I am looking for a more efficient way of applying the below code. It is functional but when I start doing it to very large dataframes it becomes quite slow. Is there a more efficient way I can do functionally the same thing? The out put is just creating a unique column from a list of columns like col1_col2_col3.
df['unique_thread'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
Try with
df['unique_thread'] = df[cols].astype(str).agg('_'.join,1)
Before you try out BENY's answer, I would suggest you to have a look at these timing results:
In [15]: %timeit a[0] + '_' + a[1]
250 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %timeit a.agg('_'.join,1)
675 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since you already know that you have to operate on 3 columns, you should do something like this:
df['unique_thread'] = df[col[0]] + '_' + df[col[1]] + '_' + df[col[2]]
Related
I have a pandas dataframe with two columns, where I need to check where the value at each row of column A is a string that starts with the value of the corresponding row at column B or viceversa.
It seems that the Series method .str.startswith cannot deal with vectorized input, so I needed to zip over the two columns in a list comprehension and create a new pd.Series with the same index as any of the two columns.
I would like this to be a vectorized operation with the .str accessor available to operate on iterables, but something like this returns NaN:
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df['a'].str.startswith(df['b'])
while my working solution is the following:
pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
I suspect that there may be a better way to tackle this issue as it also would benefit all string methods on series.
Is there any more beautiful or efficient method to do this?
One idea is use np.vecorize, but because working with strings performance is only a bit better like your solution:
def fun (a,b):
return a.startswith(b) or b.startswith(a)
f = np.vectorize(fun)
a = pd.Series(f(df['a'],df['b']), index=df.index)
print (a)
0 True
1 False
dtype: bool
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df = pd.concat([df] * 10000, ignore_index=True)
In [132]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in df[['a', 'b']].to_numpy()])
42.3 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [133]: %timeit pd.Series(f(df['a'],df['b']), index=df.index)
9.81 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
14.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#sammywemmy solution
In [135]: %timeit pd.Series([any((a.startswith(b), b.startswith(a))) for a, b in df.to_numpy()], index=df.index)
46.3 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Data
pb = {"mark_up_id":{"0":"123","1":"456","2":"789","3":"111","4":"222"},"mark_up":{"0":1.2987,"1":1.5625,"2":1.3698,"3":1.3333,"4":1.4589}}
data = {"id":{"0":"K69","1":"K70","2":"K71","3":"K72","4":"K73","5":"K74","6":"K75","7":"K79","8":"K86","9":"K100"},"cost":{"0":29.74,"1":9.42,"2":9.42,"3":9.42,"4":9.48,"5":9.48,"6":24.36,"7":5.16,"8":9.8,"9":3.28},"mark_up_id":{"0":"123","1":"456","2":"789","3":"111","4":"222","5":"333","6":"444","7":"555","8":"666","9":"777"}}
pb = pd.DataFrame(data=pb).set_index('mark_up_id')
df = pd.DataFrame(data=data)
Expected Output
test = df.join(pb, on='mark_up_id', how='left')
test['cost'].update(test['cost'] + test['mark_up'])
test.drop('mark_up',axis=1,inplace=True)
Or..
df['cost'].update(df['mark_up_id'].map(pb['mark_up']) + df['cost'])
Question
Is there a function that does the above, or is this the best way to go about this type of operation?
I would use the second solution you propose or better this:
df['cost']=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost'])
I think using update can be uncomfortable because it doesn't return anything.
Let's say Series.fillna is more flexible.
We can also use DataFrame.assign
in order to continue working on the DataFrame that the assignment returns.
df.assign( Cost=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost']) )
Time comparision with join method
%%timeit
df['cost']=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost'])
#945 µs ± 46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
test = df.join(pb, on='mark_up_id', how='left')
test['cost'].update(test['cost'] + test['mark_up'])
test.drop('mark_up',axis=1,inplace=True)
#3.59 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
slow..
%%timeit
df['cost'].update(df['mark_up_id'].map(pb['mark_up']) + df['cost'])
#985 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally,I recommend you see: Underastanding inplace and When I should use apply
I know there is more than one way to approach this and get the job done. Are there any considerations other than performance when choosing whether to use Apply Lambda? I have a particularly large dataframe with a column of emails, and I have need to strip the '#domain' from all of them. There is the simple:
DF['PRINCIPAL'] = DF['PRINCIPAL'].str.split("#", expand=True)[0]
and then the Apply Lambda:
DF['PRINCIPAL'] = DF.apply(lambda x: x['PRINCIPAL'].str.split("#", expand=True)[0]
I assume they are roughly equivalent, but they're method of execution will mean they are each more efficient in certain situations. Is there anything I should know?
Use:
df = pd.DataFrame({'email':['abc#ABC.com']*1000})
s1 = df['email'].str.split('#').str[0]
s2 = pd.Series([i.split('#')[0] for i in df['email']], name='email')
s1.eq(s2).all()
Output
True
Timings:
%timeit s1 = df['email'].str.split('#').str[0]
1.77 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit s2 = pd.Series([i.split('#')[0] for i in df['email']], name='email')
737 µs ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)
You can use assign which is the method recommended by Marc Garcia in his talk toward pandas 1.0 because you can chain operations on the same dataframe see example between 6:17 and 7:30:
DF = DF.assign(PRINCIPAL=lambda x: x['PRINCIPAL'].str.split("#", expand=True)[0])
Using List comprehensions is way faster than a normal for loop. Reason which is given for this is that there is no need of append in list comprehensions, which is understandable.
But I have found at various places that list comparisons are faster than apply. I have experienced that as well. But not able to understand as to what is the internal working that makes it much faster than apply?
I know this has something to do with vectorization in numpy which is the base implementation of pandas dataframes. But what causes list comprehensions better than apply, is not quite understandable, since, in list comprehensions, we give for loop inside the list, whereas in apply, we don't even give any for loop (and I assume there also, vectorization takes place)
Edit:
adding code:
this is working on titanic dataset, where title is extracted from name:
https://www.kaggle.com/c/titanic/data
%timeit train['NameTitle'] = train['Name'].apply(lambda x: 'Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else\
('Master' if 'Master' in x else 'None'))))
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else 'Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else ('Master' if 'Master' in x else 'None')) for x in train['Name']]
Result:
782 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
499 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit2:
To add code for SO, was creating a simple code, and surprisingly, for below code, the results reverse:
import pandas as pd
import timeit
df_test = pd.DataFrame()
tlist = []
tlist2 = []
for i in range (0,5000000):
tlist.append(i)
tlist2.append(i+5)
df_test['A'] = tlist
df_test['B'] = tlist2
display(df_test.head(5))
%timeit df_test['C'] = df_test['B'].apply(lambda x: x*2 if x%5==0 else x)
display(df_test.head(5))
%timeit df_test['C'] = [ x*2 if x%5==0 else x for x in df_test['B']]
display(df_test.head(5))
1 loop, best of 3: 2.14 s per loop
1 loop, best of 3: 2.24 s per loop
Edit3:
As suggested by some, that apply is essentially a for loop, which is not the case as if i run this code with for loop, it almost never ends, i had to stop it after 3-4 mins manually and it never completed during this time.:
for row in df_test.itertuples():
x = row.B
if x%5==0:
df_test.at[row.Index,'B'] = x*2
Running above code takes around 23 seconds, but apply takes only 1.8 seconds. So, what is the difference between these physical loop in itertuples and apply?
There are a few reasons for the performance difference between apply and list comprehension.
First of all, list comprehension in your code doesn't make a function call on each iteration, while apply does. This makes a huge difference:
map_function = lambda x: 'Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
('Master' if 'Master' in x else 'None')))
%timeit train['NameTitle'] = [map_function(x) for x in train['Name']]
# 581 µs ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
('Master' if 'Master' in x else 'None'))) for x in train['Name']]
# 482 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Secondly, apply does much more than list comprehension. For example it tries to find appropriate dtype for the result. By disabling that behaviour you can see what impact it has:
%timeit train['NameTitle'] = train['Name'].apply(map_function)
# 660 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = train['Name'].apply(map_function, convert_dtype=False)
# 626 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
There's also a bunch of other stuff happening within apply, so in this example you would want to use map:
%timeit train['NameTitle'] = train['Name'].map(map_function)
# 545 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which performs better than list comprehension with a function call in it.
Then why use apply at all you might ask? I know at least one example where it outperforms everything else -- when the operation you want to apply is a vectorized universal function. That's because apply unlike map and list comprehension allows the function to run on the whole Series instead of individual objects in it. Let's see an example:
%timeit train['AgeExp'] = train['Age'].apply(lambda x: np.exp(x))
# 1.44 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].apply(np.exp)
# 256 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].map(np.exp)
# 1.01 ms ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = [np.exp(x) for x in train['Age']]
# 1.21 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This is how I currently do it:
# Turn all table elements to strings
df = df.astype(str)
df.columns = df.columns.map(str)
df.index = df.index.map(str)
Is there a one liner that will turn df data, columns and indeces to strings?
Update
Out of curiosity I timed the various answers.
My method: 909 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#Wen's method: 749 µs ± 41.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#COLDSPEED's method: 732 ns ± 44.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Hence the accepted answer.
This isn't a bad question at all. Well, there's the obvious astype solution by #Wen. There are a couple of innovative solutions as well.
Let's try something a bit more interesting with operator.methodcaller.
from operator import methodcaller
df, df.columns, df.index = map(
methodcaller('astype', str), (df, df.columns, df.index)
)
Since you mentioned a one liner, you can recreate your df.
new_df = pd.DataFrame(
data=df.values.astype(str),
columns=df.columns.astype(str),
index=df.index.astype(str)
)