Python speed up element in list - python

I have made a script where I check if a column's values from dataframe A exist in a columns of dataframe B. Here, dataframe A is named whole data, and dataframe B is named referrarls
users=set(whole_data['user_id'])
referees=set(referrals['referee_id'])
non_referees=set([x for x in users if x not in referees])
As you can see, I want a list of users (named non_referees) that contain users that are not referees, that's why I am checking for every user_id from whole_data if it exists in the set of referees.
Nonetheless, this is taking a massive amount of time, there are like 100K users and 4K referees. Is there a way to make this faster?

First, pandas can already give you the unique values of a series, which might be faster than building the set from the whole column.
Second, to build the set of non-referees, you can then use set operations:
non_referees = users - referees
EDIT: As an additional note, if you build a set using the generator expression style, you don't need to build an intermediate list:
# slow because it first builds a list and then turns that into a set:
some_set = set([x for x in something])
# faster because it goes right into building the set:
some_other_set = set(x for x in something_else)

You may want to consider:
non_referees = set(whole_data['user_id'].unique()).difference(
referrals['referee_id'].unique()
)
If you think there are few repeats in referrals['referee_id'], then you'll gain a smidgen of speed by avoiding .unique() for them.
Speed
Here are some experiments with a few closely related forms:
Case with lots of duplicated referee_id
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 23.4 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 23.3 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 23.3 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 21.4 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 29.6 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Case with few duplicates in referee_id:
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, 100*n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 30.7 ms ± 61.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 30.7 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 30.6 ms ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 23.7 ms ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 20.9 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

What's the fastest way to select all rows in one pandas dataframe that do not exist in another?

Beginning with two pandas dataframes of different shapes, what is the fastest way to select all rows in one dataframe that do not exist in the other (or drop all rows in one dataframe that already exist in the other)? And are the fastest methods different for string-valued columns vs. numeric columns? Operation should be roughly equivalent to the code below
import pandas as pd
string_df1 = pd.DataFrame({'latin':['a', 'b', 'c'],
'greek':['alpha', 'beta', 'gamma']})
string_df2 = pd.DataFrame({'latin':['z', 'c'],
'greek':['omega', 'gamma']})
numeric_df1 = pd.DataFrame({'A':[1, 2, 3],
'B':[1.01, 2.02, 3.03]})
numeric_df2 = pd.DataFrame({'A':[3, 9],
'B':[3.03, 9.09]})
def index_matching_rows(df1, df2, cols_to_match=None):
'''
return index of subset of rows of df1 that are equal to at least one row in df2
'''
if cols_to_match is None:
cols_to_match = df1.columns
df1 = df1.reset_index()
m = df1.merge(df2, on=cols_to_match[0], suffixes=('1','2'))
query = '&'.join(['{0}1 == {0}2'.format(str(c)) for c in cols_to_match[1:]])
m = m.query(query)
return m['index']
print(string_df2.drop(index_matching_rows(string_df2, string_df1)))
print(numeric_df2.drop(index_matching_rows(numeric_df2, numeric_df1)))
output
latin greek
0 z omega
A B
1 9 9.09
some naive performance testing
copies = 10
big_sdf1 = pd.concat([string_df1, string_df1]*copies)
big_sdf2 = pd.concat([string_df2, string_df2]*copies)
big_ndf1 = pd.concat([numeric_df1, numeric_df1]*copies)
big_ndf2 = pd.concat([numeric_df2, numeric_df2]*copies)
%%timeit
big_sdf2.drop(index_matching_rows(big_sdf2, big_sdf1))
# copies = 10: 2.61 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 20: 4.44 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 30: 18.4 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 40: 74.6 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 100: 19.2 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf2.drop(index_matching_rows(big_ndf2, big_ndf1))
# copies = 10: 2.56 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 20: 4.38 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 30: 18.3 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 40: 76.5 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This code runs about as quickly for strings as for numeric data, and I think it's exponential in the length of the dataframe (the curve above is 1.6*exp(0.094x), fit to the string data). I'm working with dataframes that are on the order of 1e5 rows, so this is not a solution for me.
Here's the same performance check for Raymond Kwok's (accepted) answer below in case anyone can beat it later. It's O(n).
%%timeit
big_sdf1_tuples = big_sdf1.apply(tuple, axis=1)
big_sdf2_tuples = big_sdf2.apply(tuple, axis=1)
big_sdf2_tuples.isin(big_sdf1_tuples)
# copies = 100: 4.82 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000: 44.6 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e4: 450 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# copies = 1e5: 4.42 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
big_ndf2_tuples = big_ndf2.apply(tuple, axis=1)
big_ndf2_tuples.isin(big_ndf1_tuples)
# copies = 100: 4.98 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000: 47 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e4: 461 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# copies = 1e5: 4.58 s ± 30.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Indexing into the longest dataframe with
big_sdf2_tuples.loc[~big_sdf2_tuples.isin(big_sdf1_tuples)]
to recover the equivalent of the output in my code above adds about 10 ms.
Beginning with 2 dataframes:
df1 = {'Runner': ['A', 'A', 'A', 'A'],
'Day': ['1', '3', '8', '9'],
'Miles': ['3', '4', '4', '2']}
df2 = df.copy().drop([1,3])
where the 2nd has two rows less.
We can hash the rows:
df1_hashed = df1.apply(tuple, axis=1).apply(hash)
df2_hashed = df2.apply(tuple, axis=1).apply(hash)
and believe, like many people will, that 2 different rows are very very very unlikely to get the same hashed value,
and get rows from df1 that do not exist in df2:
df1[~df1_hashed.isin(df2_hashed)]
Runner Day Miles
1 A 3 4
3 A 9 2
As for the speed difference between string/integers, I am sure you can test it with your real data.
Note 1: you may actually remove .apply(hash) from both lines.
Note 2: check the answer of this question out for more on isin and the use of hash.
pandas has a built-in hashing utility that's more than an order of magnitude faster than series of tuples:
%%timeit
big_sdf1_hashed = pd.util.hash_pandas_object(big_sdf1)
big_sdf2_hashed = pd.util.hash_pandas_object(big_sdf2)
big_sdf1.loc[~big_sdf1_hashed.isin(big_sdf2_hashed)]
# copies = 100: 1.05 ms ± 9.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000: 1.99 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e4: 10.5 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e5: 126 ms ± 747 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e7: 14.1 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
big_ndf2_hashed = pd.util.hash_pandas_object(big_ndf2)
big_ndf1.loc[~big_ndf1_hashed.isin(big_ndf2_hashed)]
# copies = 100: 496 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000: 772 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1e4: 3.88 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e5: 67.5 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And note that the difference in performance comes from creating the objects to be compared rather than searching series of different objects. For copies = int(1e5):
%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
# 25 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
# 2.53 s ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the hashed series is also three times smaller on disk than the tuples ( 9 mb vs. 33)

How to change a column in a matrix made of tuples?

I am unsure as to the cost of transforming a matrix of tuples into a list form which is easier to manipulate. The main priority is being able to change a column of the matrix as fast as possible
I have a matrix in the form of
[(a,b,c),(d,e,f),(g,h,i)]
which can appear in any size of n x m but for this example we'll take 3x3 matrix.
my main goal is to be able to change the values of any column in the matrix (only one at a time) (eg (b,e,h)).
my initial attempt was to transform the matrix into a list ie
[[a,b,c],[d,e,f],[g,h,i]]
which would be easier
but I feel that it would be costly in terms of transforming every tuple into a list and back into a tuple.
My main question could be how to optimize this to its fullest?
In [37]: def change_column_list_comp(old_m, col, value):
...: return [
...: tuple(list(row[:col]) + [value] + list(row[col + 1:]))
...: for row in old_m
...: ]
...:
In [38]: def change_column_list_convert(old_m, col, value):
...: list_m = list(map(list, old_m))
...: for row in list_m:
...: row[col] = value
...:
...: return list(map(tuple, list_m))
...:
In [39]: m = [tuple('abc'), tuple('def'), tuple('ghi')]
In [40]: %timeit change_column_list_comp(m, 1, 2)
2.05 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [41]: %timeit change_column_list_convert(m, 1, 2)
1.28 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Looks like converting to a list, modifying the values, and converting back to tuple is faster. Note that this may not be the most efficient way of writing these functions.
However, these functions seem to start to converge as we scale up our matrix.
In [6]: m_100k = [tuple(string.printable)] * 100_000
In [7]: %timeit change_column_list_comp(m_100k, 1, 2)
163 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit change_column_list_convert(m_100k, 1, 2)
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: m_1m = [tuple(string.printable)] * 1_000_000
In [43]: %timeit change_column_list_comp(m_1m, 1, 2)
1.72 s ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %timeit change_column_list_convert(m_1m, 1, 2)
1.24 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
At the end of the day you should be using the right tools for the job. While it's not really in the OP, it's just worth mentioning that numpy is simply the better way to go.
In [13]: m_np = np.array([list('abc'), list('def'), list('ghi')])
In [17]: %timeit m_np[:, 1] = 2; m_np
610 ns ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [20]: m_np_100k = np.array([[string.printable] * 100_000])
In [21]: %timeit m_np_100k[:, 1] = 2; m_np_100k
545 ns ± 63.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: m_np_1m = np.array([[string.printable] * 1_000_000])
# This might be using cached data
In [23]: %timeit m_np_1m[:, 1] = 2; m_np_1m
515 ns ± 31.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# Avoiding cache
In [24]: %timeit m_np_1m[:, 4] = 9; m_np_1m
557 ns ± 37.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
This might not be the fairest comparison as we're manually returning the matrix, but you can see there is significant improvement.

Computing a slightly different matrix multiplication

I'm trying to find the best way to compute the minimum element wise products between two sets of vectors. The usual matrix multiplication C=A#B computes Cij as the sum of the pairwise products of the elements of the vectors Ai and B^Tj. I would like to perform instead the minimum of the pairwise products. I can't find an efficient way to do this between two matrices with numpy.
One way to achieve this would be to generate the 3D matrix of the pairwise products between A and B (before the sum) and then take the minimum over the third dimension. But this would lead to a huge memory footprint (and I actually dn't know how to do this).
Do you have any idea how I could achieve this operation ?
Example:
A = [[1,1],[1,1]]
B = [[0,2],[2,1]]
matrix matmul:
C = [[1*0+1*2,1*2+1*1][1*0+1*2,1*2+1*1]] = [[2,3],[2,3]]
minimum matmul:
C = [[min(1*0,1*2),min(1*2,1*1)][min(1*0,1*2),min(1*2,1*1)]] = [[0,1],[0,1]]
Use broadcasting after extending A to 3D -
A = np.asarray(A)
B = np.asarray(B)
C_out = np.min(A[:,None]*B,axis=2)
If you care about memory footprint, use numexpr module to be efficient about it -
import numexpr as ne
C_out = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
Timings on large arrays -
In [12]: A = np.random.rand(200,200)
In [13]: B = np.random.rand(200,200)
In [14]: %timeit np.min(A[:,None]*B,axis=2)
34.4 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
29.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: A = np.random.rand(300,300)
In [17]: B = np.random.rand(300,300)
In [18]: %timeit np.min(A[:,None]*B,axis=2)
113 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
102 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, there's some improvement with numexpr, but maybe not as much I was expecting it to be.
Numba can be also an option
I was a bit surprised of the not particularly good Numexpr Timings, so I tried a Numba Version. For large Arrays this can be optimized further. (Quite the same principles like for a dgemm can be applied)
import numpy as np
import numba as nb
import numexpr as ne
#nb.njit(fastmath=True,parallel=True)
def min_pairwise_prod(A,B):
assert A.shape[1]==B.shape[1]
res=np.empty((A.shape[0],B.shape[0]))
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
min_prod=A[i,0]*B[j,0]
for k in range(B.shape[1]):
prod=A[i,k]*B[j,k]
if prod<min_prod:
min_prod=prod
res[i,j]=min_prod
return res
Timings
A=np.random.rand(300,300)
B=np.random.rand(300,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
5.56 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
26 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
87.7 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
110 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A=np.random.rand(1000,300)
B=np.random.rand(1000,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
50.6 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
296 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
992 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
1.27 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there a faster (numpy?) way to combine pandas df int columns into dot-separated str col without TypeError

I want to combine two int columns to create a new dot-separated str column. I've got one way that works but if there is a faster way, it would help. I've also tried a suggestion I found in another answer on SO that produces an error.
This works:
df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
'task_number': [38544, 0, 1]})
df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
0 3913291.38544
1 3887250.0
2 3913041.1
This answer to a similar question suggests a "numpy" way, using .values.astype(str), but I haven't gotten it to work yet. Here I run it without including the dot separator:
df3['job_number'].values.astype(int).astype(str) + df3['task_number'].astype(int).astype(str)
0 391329138544
1 38872500
2 39130411
But when I include the dot separator I get an error:
df3['job_number'].values.astype(int).astype(str) + '.' + df3['task_number'].astype(int).astype(str)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U11') dtype('<U11') dtype('<U11')
The result I want is:
0 3913291.38544
1 3887250.0
2 3913041.1
For comparison of given methods with other available methods do refer #Jezrael answer.
Method 1
To add a dummy column containing ., use it in processing and later drop it:
%%timeit
df3['dummy'] ='.'
res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
df3.drop(columns=['dummy'], inplace=True)
1.31 ms ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
To the extension of method 1, if you exclude the processing time of dummy column creation and dropping it then it is the best you get -
%%timeit
df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
286 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2
Use apply
%timeit df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
883 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can use list comprehension:
df3["filename"] = ['.'.join(i) for i in
zip(df3["job_number"].map(str),df3["task_number"].map(str))]
If use python 3.6+ the fastest solution with f-strings:
df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
Performance in 30k rows:
df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
'task_number': [38544, 0, 1]})
df3 = pd.concat([df3] * 10000, ignore_index=True)
In [64]: %%timeit
...: df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
...:
20.5 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [65]: %%timeit
...: df3["filename3"] = ['.'.join(i) for i in zip(df3["job_number"].map(str),df3["task_number"].map(str))]
...:
30.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [66]: %%timeit
...: df3["filename4"] = df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
...:
1.7 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [67]: %%timeit
...: df3['dummy'] ='.'
...: res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
...: df3.drop(columns=['dummy'], inplace=True)
...:
73.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
But also very fast is original solution:
In [73]: %%timeit
...: df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
48.3 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
With small modification - using map instead astype:
In [76]: %%timeit
...: df3['filename'] = df3['job_number'].map(str) + '.' + df3['task_number'].map(str)
...:
26 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Methods in order of %%timeit results
I timed all the suggested methods and a few more on two DataFrames. Here are the timed results for the suggested methods (thank you #meW and #jezrael). If I missed any or you have another, let me know and I'll add it.
Two timings are shown for each method: first for processing the 3 rows in the example df and then for processing 57K rows in another df. Timings may vary on another system. Solutions that include TEST['dot'] in the concatenation string require this column in the df: add it with TEST['dot'] = '.'.
Original method (still the fastest):
.astype(str), +, '.'
%%timeit
TEST['filename'] = TEST['job_number'].astype(str) + '.' + TEST['task_number'].astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 69.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
Proposed methods and a few permutations on them:
.astype(int).astype(str), +, '.'
%%timeit
TEST['filename'] = TEST['job_number'].astype(int).astype(str) + '.' + TEST['task_number'].astype(int).astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 70.2 ms ± 739 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
.values.astype(int).astype(str), +, TEST['dot']
%%timeit
TEST['filename'] = TEST['job_number'].values.astype(int).astype(str) + TEST['dot'] + TEST['task_number'].values.astype(int).astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 82.3 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
.values.astype(str), +, TEST['dot']
%%timeit
TEST["filename"] = TEST['job_number'].values.astype(str) + TEST['dot'] + TEST['task_number'].values.astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 92.8 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
'.'.join(), list comprehension, .values.astype(str)
%%timeit
TEST["filename"] = ['.'.join(i) for i in TEST[["job_number",'task_number']].values.astype(str)]
# 743 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 147 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
f-string, list comprehension, .values.astype(str)
%%timeit
TEST["filename2"] = [f'{i}.{j}' for i,j in TEST[["job_number",'task_number']].values.astype(str)]
# 642 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 167 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
'.'.join(), zip, list comprehension, .map(str)
%%timeit
TEST["filename"] = ['.'.join(i) for i in
zip(TEST["job_number"].map(str), TEST["task_number"].map(str))]
# 512 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 181 ms ± 4.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
apply(lambda, str(x[2]), +, '.')
%%timeit
TEST['filename'] = TEST.T.apply(lambda x: str(x[2]) + '.' + str(x[10]))
# 735 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 2.69 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
If you see a way to improve on any of these, please let me know and I'll add to the list!

Why use reset_index(drop=True) when setting the index is much faster?

Why would I use reset_index(drop=True), when the alternative is much faster? I am sure there is something I am missing. (Or my timings are bad somehow...)
import pandas as pd
l = pd.Series(range(int(1e7)))
%timeit l.reset_index(drop=True)
# 35.9 ms +- 1.29 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
%timeit l.index = range(int(1e7))
# 13 us +- 455 ns per loop (mean +- std. dev. of 7 runs, 100000 loops each)
The costly operation in reseting the index is not to create the new index (as you showed, that is super fast) but to return a copy of the series. If you compare:
%timeit l.reset_index(drop=True)
22.6 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit l.index = range(int(1e7))
14.7 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit l.reset_index(inplace=True, drop=True)
13.7 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You can see that the inplace operation (where no copy is returned) is more or less equally fast as your methode. However it is generally discouraged to perform inplace operations.

Categories

Resources