I know there is more than one way to approach this and get the job done. Are there any considerations other than performance when choosing whether to use Apply Lambda? I have a particularly large dataframe with a column of emails, and I have need to strip the '#domain' from all of them. There is the simple:
DF['PRINCIPAL'] = DF['PRINCIPAL'].str.split("#", expand=True)[0]
and then the Apply Lambda:
DF['PRINCIPAL'] = DF.apply(lambda x: x['PRINCIPAL'].str.split("#", expand=True)[0]
I assume they are roughly equivalent, but they're method of execution will mean they are each more efficient in certain situations. Is there anything I should know?
Use:
df = pd.DataFrame({'email':['abc#ABC.com']*1000})
s1 = df['email'].str.split('#').str[0]
s2 = pd.Series([i.split('#')[0] for i in df['email']], name='email')
s1.eq(s2).all()
Output
True
Timings:
%timeit s1 = df['email'].str.split('#').str[0]
1.77 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit s2 = pd.Series([i.split('#')[0] for i in df['email']], name='email')
737 µs ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)
You can use assign which is the method recommended by Marc Garcia in his talk toward pandas 1.0 because you can chain operations on the same dataframe see example between 6:17 and 7:30:
DF = DF.assign(PRINCIPAL=lambda x: x['PRINCIPAL'].str.split("#", expand=True)[0])
Related
I need a way to vectorize the eval statement that loops through the reference dataframe (df_ref) that has the strings that iloc to the source df (df)
Here's a reprex of the problem:
import pandas as pd
dataValues = ['a','b','c']
df = pd.DataFrame({'values': dataValues})
df_list = ['df.iloc[0,0]','df.iloc[2,0]']
df_ref = pd.DataFrame({'ref':df_list})
#looping 10000 times just to replicate a the amount of times this operation
#may run in a typical scenario
def help():
for i in range(10000):
for index,row in df_ref.iterrows():
eval(df_ref.ref[index])
%timeit help()
Outputs:
2.84 s ± 366 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Before you respond with an alternative.. My problem will always have some dynamic referencing that I have to replicate in python, so a more straightforward route may not solve my particular problem.
Thanks for the help!
As commented, extract the indexes first, then slice into the whole thing:
def help1():
indexes = df_ref['ref'].str.extract('iloc\[(\d+),\s*(\d+)\]').astype(int)
return df.to_numpy()[indexes[0], indexes[1]]
and
help1()
> array(['a', 'c'], dtype=object)
%timeit -n 1000 help1()
> 1.11 ms ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am looking for a more efficient way of applying the below code. It is functional but when I start doing it to very large dataframes it becomes quite slow. Is there a more efficient way I can do functionally the same thing? The out put is just creating a unique column from a list of columns like col1_col2_col3.
df['unique_thread'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
Try with
df['unique_thread'] = df[cols].astype(str).agg('_'.join,1)
Before you try out BENY's answer, I would suggest you to have a look at these timing results:
In [15]: %timeit a[0] + '_' + a[1]
250 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %timeit a.agg('_'.join,1)
675 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since you already know that you have to operate on 3 columns, you should do something like this:
df['unique_thread'] = df[col[0]] + '_' + df[col[1]] + '_' + df[col[2]]
I have a pandas dataframe with two columns, where I need to check where the value at each row of column A is a string that starts with the value of the corresponding row at column B or viceversa.
It seems that the Series method .str.startswith cannot deal with vectorized input, so I needed to zip over the two columns in a list comprehension and create a new pd.Series with the same index as any of the two columns.
I would like this to be a vectorized operation with the .str accessor available to operate on iterables, but something like this returns NaN:
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df['a'].str.startswith(df['b'])
while my working solution is the following:
pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
I suspect that there may be a better way to tackle this issue as it also would benefit all string methods on series.
Is there any more beautiful or efficient method to do this?
One idea is use np.vecorize, but because working with strings performance is only a bit better like your solution:
def fun (a,b):
return a.startswith(b) or b.startswith(a)
f = np.vectorize(fun)
a = pd.Series(f(df['a'],df['b']), index=df.index)
print (a)
0 True
1 False
dtype: bool
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df = pd.concat([df] * 10000, ignore_index=True)
In [132]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in df[['a', 'b']].to_numpy()])
42.3 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [133]: %timeit pd.Series(f(df['a'],df['b']), index=df.index)
9.81 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
14.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#sammywemmy solution
In [135]: %timeit pd.Series([any((a.startswith(b), b.startswith(a))) for a, b in df.to_numpy()], index=df.index)
46.3 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
With this pycon talk as a source.
def clean_string(item):
if type(item)==type(1):
return item
else:
return np.nan
dataframe object has a column containing numerical and string data, I want to change strings to np.nan
while leaving numerical data as it is.
This approach is working fine
df['Energy Supply'].apply(clean_string)
but when I am trying to use vectorisation, values of all the column items changed to np.nan
df['Energy Supply'] = clean_string(df['Energy Supply']) # vectorisation
but the above method is converting all items to np.nan. I believe this is because type(item) in clean_string function is pd.Series type.
Is there a way to overcome this problem?
PS: I am a beginner in pandas
Vectorizing an operation in pandas isn't always possible. I'm not aware of a pandas built-in vectorized way to get the type of the elements in a Series, so your .apply() solution may be the best approach.
The reason that your code doesn't work in the second case is that you are passing the entire Series to your clean_string() function. It compares the type of the Series to type(1), which is False and then returns one value np.nan. Pandas automatically broadcasts this value when assigning it back to the df, so you get a column of NaN. In order to avoid this, you would have to loop over all of the elements in the Series in your clean_string() function.
Out of curiosity, I tried a few other approaches to see if any of them would be faster than your version. To test, I created 10,000 and 100,000 element pd.Series with alternating integer and string values:
import numpy as np
import pandas as pd
s = pd.Series(i if i%2==0 else str(i) for i in range(10000))
s2 = pd.Series(i if i%2==0 else str(i) for i in range(100000))
These tests are done using pandas 1.0.3 and python 3.8.
Baseline using clean_string()
In []: %timeit s.apply(clean_string)
3.75 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In []: %timeit s2.apply(clean_string)
39.5 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Series.str methods
An alternative way to test for strings vs. non-strings would be to use the built-in .str functions on the Series, for example, if you apply .str.len(), it will return NaN for any non-strings in the Series. These are even called "Vectorized String Methods" in pandas documentation, so maybe they will be more efficient.
In []: %timeit s.mask(s.str.len()>0)
6 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In []: %timeit s2.mask(s2.str.len()>0)
56.8 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this approach is slower than the .apply(). Despite being "vectorized" it doesn't look like this is a better approach. It is also not quite identical to the logic of clean_string() because it is testing for elements that are strings not for elements that are integers.
Applying type directly to the Series
Based on this answer, I decided to test using .apply() with type to get the type of each element. Once we know the type, compare to int and use the .mask() method to convert any non-integers to NaN.
In []: %timeit s.mask(s.apply(type)!=int)
1.88 ms ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In []: %timeit s2.mask(s2.apply(type)!=int)
15.2 ms ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This turns out to be the fastest approach that I've found.
Data
pb = {"mark_up_id":{"0":"123","1":"456","2":"789","3":"111","4":"222"},"mark_up":{"0":1.2987,"1":1.5625,"2":1.3698,"3":1.3333,"4":1.4589}}
data = {"id":{"0":"K69","1":"K70","2":"K71","3":"K72","4":"K73","5":"K74","6":"K75","7":"K79","8":"K86","9":"K100"},"cost":{"0":29.74,"1":9.42,"2":9.42,"3":9.42,"4":9.48,"5":9.48,"6":24.36,"7":5.16,"8":9.8,"9":3.28},"mark_up_id":{"0":"123","1":"456","2":"789","3":"111","4":"222","5":"333","6":"444","7":"555","8":"666","9":"777"}}
pb = pd.DataFrame(data=pb).set_index('mark_up_id')
df = pd.DataFrame(data=data)
Expected Output
test = df.join(pb, on='mark_up_id', how='left')
test['cost'].update(test['cost'] + test['mark_up'])
test.drop('mark_up',axis=1,inplace=True)
Or..
df['cost'].update(df['mark_up_id'].map(pb['mark_up']) + df['cost'])
Question
Is there a function that does the above, or is this the best way to go about this type of operation?
I would use the second solution you propose or better this:
df['cost']=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost'])
I think using update can be uncomfortable because it doesn't return anything.
Let's say Series.fillna is more flexible.
We can also use DataFrame.assign
in order to continue working on the DataFrame that the assignment returns.
df.assign( Cost=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost']) )
Time comparision with join method
%%timeit
df['cost']=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost'])
#945 µs ± 46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
test = df.join(pb, on='mark_up_id', how='left')
test['cost'].update(test['cost'] + test['mark_up'])
test.drop('mark_up',axis=1,inplace=True)
#3.59 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
slow..
%%timeit
df['cost'].update(df['mark_up_id'].map(pb['mark_up']) + df['cost'])
#985 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally,I recommend you see: Underastanding inplace and When I should use apply