I am unsure as to the cost of transforming a matrix of tuples into a list form which is easier to manipulate. The main priority is being able to change a column of the matrix as fast as possible
I have a matrix in the form of
[(a,b,c),(d,e,f),(g,h,i)]
which can appear in any size of n x m but for this example we'll take 3x3 matrix.
my main goal is to be able to change the values of any column in the matrix (only one at a time) (eg (b,e,h)).
my initial attempt was to transform the matrix into a list ie
[[a,b,c],[d,e,f],[g,h,i]]
which would be easier
but I feel that it would be costly in terms of transforming every tuple into a list and back into a tuple.
My main question could be how to optimize this to its fullest?
In [37]: def change_column_list_comp(old_m, col, value):
...: return [
...: tuple(list(row[:col]) + [value] + list(row[col + 1:]))
...: for row in old_m
...: ]
...:
In [38]: def change_column_list_convert(old_m, col, value):
...: list_m = list(map(list, old_m))
...: for row in list_m:
...: row[col] = value
...:
...: return list(map(tuple, list_m))
...:
In [39]: m = [tuple('abc'), tuple('def'), tuple('ghi')]
In [40]: %timeit change_column_list_comp(m, 1, 2)
2.05 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [41]: %timeit change_column_list_convert(m, 1, 2)
1.28 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Looks like converting to a list, modifying the values, and converting back to tuple is faster. Note that this may not be the most efficient way of writing these functions.
However, these functions seem to start to converge as we scale up our matrix.
In [6]: m_100k = [tuple(string.printable)] * 100_000
In [7]: %timeit change_column_list_comp(m_100k, 1, 2)
163 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit change_column_list_convert(m_100k, 1, 2)
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: m_1m = [tuple(string.printable)] * 1_000_000
In [43]: %timeit change_column_list_comp(m_1m, 1, 2)
1.72 s ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %timeit change_column_list_convert(m_1m, 1, 2)
1.24 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
At the end of the day you should be using the right tools for the job. While it's not really in the OP, it's just worth mentioning that numpy is simply the better way to go.
In [13]: m_np = np.array([list('abc'), list('def'), list('ghi')])
In [17]: %timeit m_np[:, 1] = 2; m_np
610 ns ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [20]: m_np_100k = np.array([[string.printable] * 100_000])
In [21]: %timeit m_np_100k[:, 1] = 2; m_np_100k
545 ns ± 63.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: m_np_1m = np.array([[string.printable] * 1_000_000])
# This might be using cached data
In [23]: %timeit m_np_1m[:, 1] = 2; m_np_1m
515 ns ± 31.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# Avoiding cache
In [24]: %timeit m_np_1m[:, 4] = 9; m_np_1m
557 ns ± 37.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
This might not be the fairest comparison as we're manually returning the matrix, but you can see there is significant improvement.
Related
I have made a script where I check if a column's values from dataframe A exist in a columns of dataframe B. Here, dataframe A is named whole data, and dataframe B is named referrarls
users=set(whole_data['user_id'])
referees=set(referrals['referee_id'])
non_referees=set([x for x in users if x not in referees])
As you can see, I want a list of users (named non_referees) that contain users that are not referees, that's why I am checking for every user_id from whole_data if it exists in the set of referees.
Nonetheless, this is taking a massive amount of time, there are like 100K users and 4K referees. Is there a way to make this faster?
First, pandas can already give you the unique values of a series, which might be faster than building the set from the whole column.
Second, to build the set of non-referees, you can then use set operations:
non_referees = users - referees
EDIT: As an additional note, if you build a set using the generator expression style, you don't need to build an intermediate list:
# slow because it first builds a list and then turns that into a set:
some_set = set([x for x in something])
# faster because it goes right into building the set:
some_other_set = set(x for x in something_else)
You may want to consider:
non_referees = set(whole_data['user_id'].unique()).difference(
referrals['referee_id'].unique()
)
If you think there are few repeats in referrals['referee_id'], then you'll gain a smidgen of speed by avoiding .unique() for them.
Speed
Here are some experiments with a few closely related forms:
Case with lots of duplicated referee_id
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 23.4 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 23.3 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 23.3 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 21.4 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 29.6 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Case with few duplicates in referee_id:
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, 100*n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 30.7 ms ± 61.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 30.7 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 30.6 ms ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 23.7 ms ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 20.9 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a calculation in my code that get carried out thousands of times and I wanted to see if I could make it faster as it is currently using two nested loops. I assumed that if I used broadcasting I could make it several times faster.
I've shown the two options below, which thankfully give the same results.
import numpy as np
n = 1000
x = np.random.random([n, 3])
y = np.random.random([n, 3])
func_weight = np.random.random(n)
result = np.zeros([n, 9])
result_2 = np.zeros([n, 9])
# existing
for a in range(3):
for b in range(3):
result[:, 3*a + b] = x[:, a] * y[:, b] * func_weight
# broadcasting - assumed this would be faster
for a in range(3):
result_2[:, 3*a:3*(a+1)] = np.expand_dims(x[:, a], axis=-1) * y * np.expand_dims(func_weight, axis=-1)
Timings
n=100
nested loops: 24.7 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
broadcasting: 70.3 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n=1000
nested loops: 50.5 µs ± 913 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
broadcasting: 148 µs ± 372 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n=10000
nested loops: 327 µs ± 7.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
broadcasting: 864 µs ± 5.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In my testing, broadcasting is always slower, so I'm a little confused as to what is happening. I'm guessing that because I had to use expand_dims to get the shapes aligned in the second solution, that is what the big impact on performance is. Is that correct? As the array size grows, there's not much change in performance with the nested loop always about 3 times quicker.
Is there a more optimal third solution that I haven't considered?
In [126]: %%timeit
...: result = np.zeros([n,9])
...: for a in range(3):
...: for b in range(3):
...: result[:, 3*a + b] = x[:, a] * y[:, b] * func_weight
141 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [128]: %%timeit
...: result_2 = np.zeros([n,9])
...: for a in range(3):
...: result_2[:, 3*a:3*(a+1)] = np.expand_dims(x[:, a], axis=-1) * y * n
...: p.expand_dims(func_weight, axis=-1)
202 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A fully broadcasted version:
In [130]: %%timeit
...: result_3 = (x[:,:,None]*y[:,None,:]*func_weight[:,None,None]).reshape(
...: n,9)
88.8 µs ± 73.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Replacing the expand_dims with np.newaxis/None expansion:
In [131]: %%timeit
...: result_2 = np.zeros([n,9])
...: for a in range(3):
...: result_2[:, 3*a:3*(a+1)] = x[:, a,None] * y * func_weight[:,None]
132 µs ± 315 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So yes, expand_dims is a bit slow, I think because it tries to be general purpose. And an extra layer of function calls.
expand_dims is just a.reshape(shape), but it takes a bit of time to translate your axis parameter into the shape tuple. As an experienced user I find that the None syntax is clearer (and faster) - visually it stands out as a dimension-adding action.
I'm trying to find the best way to compute the minimum element wise products between two sets of vectors. The usual matrix multiplication C=A#B computes Cij as the sum of the pairwise products of the elements of the vectors Ai and B^Tj. I would like to perform instead the minimum of the pairwise products. I can't find an efficient way to do this between two matrices with numpy.
One way to achieve this would be to generate the 3D matrix of the pairwise products between A and B (before the sum) and then take the minimum over the third dimension. But this would lead to a huge memory footprint (and I actually dn't know how to do this).
Do you have any idea how I could achieve this operation ?
Example:
A = [[1,1],[1,1]]
B = [[0,2],[2,1]]
matrix matmul:
C = [[1*0+1*2,1*2+1*1][1*0+1*2,1*2+1*1]] = [[2,3],[2,3]]
minimum matmul:
C = [[min(1*0,1*2),min(1*2,1*1)][min(1*0,1*2),min(1*2,1*1)]] = [[0,1],[0,1]]
Use broadcasting after extending A to 3D -
A = np.asarray(A)
B = np.asarray(B)
C_out = np.min(A[:,None]*B,axis=2)
If you care about memory footprint, use numexpr module to be efficient about it -
import numexpr as ne
C_out = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
Timings on large arrays -
In [12]: A = np.random.rand(200,200)
In [13]: B = np.random.rand(200,200)
In [14]: %timeit np.min(A[:,None]*B,axis=2)
34.4 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
29.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: A = np.random.rand(300,300)
In [17]: B = np.random.rand(300,300)
In [18]: %timeit np.min(A[:,None]*B,axis=2)
113 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
102 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, there's some improvement with numexpr, but maybe not as much I was expecting it to be.
Numba can be also an option
I was a bit surprised of the not particularly good Numexpr Timings, so I tried a Numba Version. For large Arrays this can be optimized further. (Quite the same principles like for a dgemm can be applied)
import numpy as np
import numba as nb
import numexpr as ne
#nb.njit(fastmath=True,parallel=True)
def min_pairwise_prod(A,B):
assert A.shape[1]==B.shape[1]
res=np.empty((A.shape[0],B.shape[0]))
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
min_prod=A[i,0]*B[j,0]
for k in range(B.shape[1]):
prod=A[i,k]*B[j,k]
if prod<min_prod:
min_prod=prod
res[i,j]=min_prod
return res
Timings
A=np.random.rand(300,300)
B=np.random.rand(300,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
5.56 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
26 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
87.7 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
110 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A=np.random.rand(1000,300)
B=np.random.rand(1000,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
50.6 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
296 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
992 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
1.27 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I want to combine two int columns to create a new dot-separated str column. I've got one way that works but if there is a faster way, it would help. I've also tried a suggestion I found in another answer on SO that produces an error.
This works:
df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
'task_number': [38544, 0, 1]})
df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
0 3913291.38544
1 3887250.0
2 3913041.1
This answer to a similar question suggests a "numpy" way, using .values.astype(str), but I haven't gotten it to work yet. Here I run it without including the dot separator:
df3['job_number'].values.astype(int).astype(str) + df3['task_number'].astype(int).astype(str)
0 391329138544
1 38872500
2 39130411
But when I include the dot separator I get an error:
df3['job_number'].values.astype(int).astype(str) + '.' + df3['task_number'].astype(int).astype(str)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U11') dtype('<U11') dtype('<U11')
The result I want is:
0 3913291.38544
1 3887250.0
2 3913041.1
For comparison of given methods with other available methods do refer #Jezrael answer.
Method 1
To add a dummy column containing ., use it in processing and later drop it:
%%timeit
df3['dummy'] ='.'
res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
df3.drop(columns=['dummy'], inplace=True)
1.31 ms ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
To the extension of method 1, if you exclude the processing time of dummy column creation and dropping it then it is the best you get -
%%timeit
df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
286 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2
Use apply
%timeit df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
883 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can use list comprehension:
df3["filename"] = ['.'.join(i) for i in
zip(df3["job_number"].map(str),df3["task_number"].map(str))]
If use python 3.6+ the fastest solution with f-strings:
df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
Performance in 30k rows:
df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
'task_number': [38544, 0, 1]})
df3 = pd.concat([df3] * 10000, ignore_index=True)
In [64]: %%timeit
...: df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
...:
20.5 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [65]: %%timeit
...: df3["filename3"] = ['.'.join(i) for i in zip(df3["job_number"].map(str),df3["task_number"].map(str))]
...:
30.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [66]: %%timeit
...: df3["filename4"] = df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
...:
1.7 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [67]: %%timeit
...: df3['dummy'] ='.'
...: res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
...: df3.drop(columns=['dummy'], inplace=True)
...:
73.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
But also very fast is original solution:
In [73]: %%timeit
...: df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
48.3 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
With small modification - using map instead astype:
In [76]: %%timeit
...: df3['filename'] = df3['job_number'].map(str) + '.' + df3['task_number'].map(str)
...:
26 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Methods in order of %%timeit results
I timed all the suggested methods and a few more on two DataFrames. Here are the timed results for the suggested methods (thank you #meW and #jezrael). If I missed any or you have another, let me know and I'll add it.
Two timings are shown for each method: first for processing the 3 rows in the example df and then for processing 57K rows in another df. Timings may vary on another system. Solutions that include TEST['dot'] in the concatenation string require this column in the df: add it with TEST['dot'] = '.'.
Original method (still the fastest):
.astype(str), +, '.'
%%timeit
TEST['filename'] = TEST['job_number'].astype(str) + '.' + TEST['task_number'].astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 69.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
Proposed methods and a few permutations on them:
.astype(int).astype(str), +, '.'
%%timeit
TEST['filename'] = TEST['job_number'].astype(int).astype(str) + '.' + TEST['task_number'].astype(int).astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 70.2 ms ± 739 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
.values.astype(int).astype(str), +, TEST['dot']
%%timeit
TEST['filename'] = TEST['job_number'].values.astype(int).astype(str) + TEST['dot'] + TEST['task_number'].values.astype(int).astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 82.3 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
.values.astype(str), +, TEST['dot']
%%timeit
TEST["filename"] = TEST['job_number'].values.astype(str) + TEST['dot'] + TEST['task_number'].values.astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 92.8 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
'.'.join(), list comprehension, .values.astype(str)
%%timeit
TEST["filename"] = ['.'.join(i) for i in TEST[["job_number",'task_number']].values.astype(str)]
# 743 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 147 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
f-string, list comprehension, .values.astype(str)
%%timeit
TEST["filename2"] = [f'{i}.{j}' for i,j in TEST[["job_number",'task_number']].values.astype(str)]
# 642 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 167 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
'.'.join(), zip, list comprehension, .map(str)
%%timeit
TEST["filename"] = ['.'.join(i) for i in
zip(TEST["job_number"].map(str), TEST["task_number"].map(str))]
# 512 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 181 ms ± 4.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
apply(lambda, str(x[2]), +, '.')
%%timeit
TEST['filename'] = TEST.T.apply(lambda x: str(x[2]) + '.' + str(x[10]))
# 735 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 2.69 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
If you see a way to improve on any of these, please let me know and I'll add to the list!
I would like to understand why Dataframe attributes access time seems so long (often 100 x slower vs other objects). An example :
In [37]: df=pd.DataFrame([])
In [38]: a=np.array([])
In [39]: %timeit df.size
28 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: %timeit a.size
136 ns ± 9.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)