Optimized computation of pairwise correlations in Python - python

Given a set of discrete locations (e.g. "sites") that are pairwise related in some categorical ways (e.g. general proximity) and contains local level data (e.g. population size), I wish to efficiently compute the mean correlation coefficients between local level data on pairwise locations that are characterized by the same relationships.
For example, I assumed 100 sites and randomized their pairwise relations using values 1 to 25, yielding the triangular matrix relations:
import numpy as np
sites = 100
categ = 25
relations = np.random.randint(low=1, high=categ+1, size=(sites, sites))
relations = np.triu(relations) # set relation_ij = relation_ji
np.fill_diagonal(relations, 0) # ignore self-relation
I also have 5000 replicates of simulation results on each site:
sims = 5000
res = np.round(np.random.rand(sites, sims),1)
To compute the mean pairwise correlation for each specific relation category, I first calculated for each relation category i the correlation coefficient rho[j] between the simulation results res of each unique site pairs j, and then taking the average across all possible pairs with relation i:
rho_list = np.ones(categ)*99
for i in range(1, categ+1):
idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category
comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category
comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs
rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category
for j in range(len(idr)): # loop through unique site pairs
comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result
rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]
rho_list[i-1] = np.nanmean(rho)
Although this script works, but once I increase sites = 400, then the entire computation can take more than 6 hrs to finish, which leads me to question my use of array functions. What is the reason for this poor performance? And how can I optimize the algorithm?

We can vectorize the innermost loop with j iterator with some masking to take care of the ragged nature of data being processed at each iteration of that loop. We can also take out slowish np.corrcoef (inspired by this post). Additionally, we can optimize few steps at the start of the outer-loop, specially the stacking steps, which could be the bottlenecks.
Thus, the complete code would reduce to something like this -
for i in range(1, categ+1):
r,c = np.where(relations==i)
A = res[r]
B = res[c]
mask0 = ~((A!=0) & (B!=0))
A[mask0] = 0
B[mask0] = 0
count = mask0.shape[-1] - mask0.sum(-1,keepdims=1)
A_mA = A - A.sum(-1, keepdims=1)/count
B_mB = B - B.sum(-1, keepdims=1)/count
A_mA[mask0] = 0
B_mB[mask0] = 0
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)
rho = np.einsum('ij,ij->i',A_mA, B_mB)/np.sqrt(ssA*ssB)
rho_list[i-1] = np.nanmean(rho)
Runtime tests
Case # 1: On given sample data, with sites = 100
In [381]: %timeit loopy_app()
1 loop, best of 3: 7.45 s per loop
In [382]: %timeit vectorized_app()
1 loop, best of 3: 479 ms per loop
15x+ speedup.
Case # 2: With sites = 200
In [387]: %timeit loopy_app()
1 loop, best of 3: 1min 56s per loop
In [388]: %timeit vectorized_app()
1 loop, best of 3: 1.86 s per loop
In [390]: 116/1.86
Out[390]: 62.36559139784946
62x+ speedup.
Case # 3: Finally with sites = 400
In [392]: %timeit vectorized_app()
1 loop, best of 3: 7.64 s per loop
This took 6hrs+ at OP's end with their loopy method.
From the timings, it became clear that vectorizing the inner loop was the key to getting noticeable speedups for large sites.

Related

What makes apply method in Pandas so inefficient [duplicate]

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).
I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)
For my case, I have two kinds of tasks to do with the apply function.
First: apply with lookup dict query
f(p_id, p_dict):
return p_dict[p_dict['ID'] == p_id]['value']
p_dict = DataFrame(...) # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))
Second: apply with groupby
f(week_id, min_week_num, p_dict):
return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()
f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])
I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.
My question is:
Is there any way to substitute apply in the two above cases?
Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.
Concerning your first question, I can't say exactly why this instance is slow. But generally, apply does not take advantage of vectorization. Also, apply returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).
For your first method, I assume you are trying to fill a 'value' column in df using the p_dict as a lookup table. It is about 1000x faster to use pd.merge:
import string, sys
import numpy as np
import pandas as pd
##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]
# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
# Apply the f1 method as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop
# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)
>>> 1000 loops, best of 5: 826 µs per loop
Concerning the second task, we can quickly add a new column to p_dict that calculates a mean where the time window starts at min_week_num and ends at the week for that row in p_dict. This requires that p_dict is sorted by ascending order along the WEEK column. Then you can use pd.merge again.
I am assuming that min_week_num is 0 in the following example. But you could easily modify rolling_growing_mean to take a different value. The rolling_growing_mean method will run in O(n) since it conducts a fixed number of operations per iteration.
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})
def rolling_growing_mean(values):
out = np.empty(len(values))
out[0] = values[0]
# Time window for taking mean grows each step
for i, v in enumerate(values[1:]):
out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
return out
p_dict['Means'] = rolling_growing_mean(p_dict['value'])
df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')

What is the best efficient way to loop through 2d array in Python

I am new to Python and machine learning. I can't find best way on the internet. I have a big 2d array (distance_matrix.shape= (47, 1328624)). I wrote below code but it takes too long time to run. For loop in for loop takes so time.
distance_matrix = [[0.21218192, 0.12845819, 0.54545613, 0.92464129, 0.12051526, 0.0870853 ], [0.2168166 , 0.11174682, 0.58193855, 0.93949729, 0.08060061, 0.11963891], [0.23996999, 0.17554854, 0.60833433, 0.93914766, 0.11631545, 0.2036373]]
iskeleler = pd.DataFrame({
'lat':[40.992752,41.083202,41.173462],
'lon':[29.023165,29.066652,29.088163],
'name':['Kadıköy','AnadoluHisarı','AnadoluKavağı']
}, dtype=str)
for i in range(len(distance_matrix)):
for j in range(len(distance_matrix[0])):
if distance_matrix[i][j] < 1:
iskeleler.loc[i,'Address'] = distance_matrix[i][j]
print(iskeleler)
To explain, I am sharing the first 5 rows of my array and showing my dataframe.
İskeleler dataframe
distance_matrix
The "İskeleler" dataframe has 47 rows. I want to add them to the 'Address' column in row i in the "İskeleler" by looking at all the values in row i in the distance_matrix and adding the ones less than 1. I mean if we look at the first row in the distance_matrix photo, I want to add the numbers like 0.21218192 + 0.12845819 + 0.54545613 .... and put them in the 'address' column in the i'th row in the İskeleler dataframe.
My intend is to loop through distance_matrix and find some values which smaller than 1. The code takes too long. How can i do this with faster way?
I think you mean this:
import numpy as np
# Set up some dummy data in range 0..100
distance = np.random.rand(47,1328624) * 100.0
# Boolean mask of all values < 1
mLessThan1 = distance<1
# Sum elements <1 across rows
result = np.sum(distance*mLessThan1, axis=1)
That takes 168ms on my Mac.
In [47]: %timeit res = np.sum(distance*mLessThan1, axis=1)
168 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas - Explanation on apply function being slow

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).
I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)
For my case, I have two kinds of tasks to do with the apply function.
First: apply with lookup dict query
f(p_id, p_dict):
return p_dict[p_dict['ID'] == p_id]['value']
p_dict = DataFrame(...) # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))
Second: apply with groupby
f(week_id, min_week_num, p_dict):
return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()
f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])
I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.
My question is:
Is there any way to substitute apply in the two above cases?
Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.
Concerning your first question, I can't say exactly why this instance is slow. But generally, apply does not take advantage of vectorization. Also, apply returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).
For your first method, I assume you are trying to fill a 'value' column in df using the p_dict as a lookup table. It is about 1000x faster to use pd.merge:
import string, sys
import numpy as np
import pandas as pd
##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]
# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
# Apply the f1 method as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop
# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)
>>> 1000 loops, best of 5: 826 µs per loop
Concerning the second task, we can quickly add a new column to p_dict that calculates a mean where the time window starts at min_week_num and ends at the week for that row in p_dict. This requires that p_dict is sorted by ascending order along the WEEK column. Then you can use pd.merge again.
I am assuming that min_week_num is 0 in the following example. But you could easily modify rolling_growing_mean to take a different value. The rolling_growing_mean method will run in O(n) since it conducts a fixed number of operations per iteration.
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})
def rolling_growing_mean(values):
out = np.empty(len(values))
out[0] = values[0]
# Time window for taking mean grows each step
for i, v in enumerate(values[1:]):
out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
return out
p_dict['Means'] = rolling_growing_mean(p_dict['value'])
df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')

Groupby in python pandas: Fast Way

I want to improve the time of a groupby in python pandas.
I have this code:
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
The objective is to count how many contracts a client has in a month and add this information in a new column (Nbcontrats).
Client: client code
Month: month of data extraction
Contrat: contract number
I want to improve the time. Below I am only working with a subset of my real data:
%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop
df.shape
Out[309]: (7464, 61)
How can I improve the execution time?
Here's one way to proceed :
Slice out the relevant columns (['Client', 'Month']) from the input dataframe into a NumPy array. This is mostly a performance-focused idea as we would be using NumPy functions later on, which are optimized to work with NumPy arrays.
Convert the two columns data from ['Client', 'Month'] into a single 1D array, which would be a linear index equivalent of it considering elements from the two columns as pairs. Thus, we can assume that the elements from 'Client' represent the row indices, whereas 'Month' elements are the column indices. This is like going from 2D to 1D. But, the issue would be deciding the shape of the 2D grid to perform such a mapping. To cover all pairs, one safe assumption would be assuming a 2D grid whose dimensions are one more than the max along each column because of 0-based indexing in Python. Thus, we would get linear indices.
Next up, we tag each linear index based on their uniqueness among others. I think this would correspond to the keys obtained with grouby instead. We also need to get counts of each group/unique key along the entire length of that 1D array. Finally, indexing into the counts with those tags should map for each element the respective counts.
That's the whole idea about it! Here's the implementation -
# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values
# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]
Runtime test
1) Define functions :
def original_app(df):
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
def vectorized_app(df):
arr_slice = df[['Client', 'Month']].values
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
df["Nbcontrats"] = counts[unqtags]
2) Verify results :
In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
...: # Run the function on the inputs
...: original_app(df)
...: vectorized_app(df1)
...:
In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True
3) Finally time them :
In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop
In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop
With the DataFrameGroupBy.size method:
df.set_index(['Client', 'Month'], inplace=True)
df['Nbcontrats'] = df.groupby(level=(0,1)).size()
df.reset_index(inplace=True)
The most work goes into assigning the result back into a column of the source DataFrame.

Fill a Pytables array with random values: horizontally vs vertically

So I am using Pytables to store a numpy array of size (10,000 x 100). My goal is to fill it with random values.
import tables as tb
h5File = '/Users/me/tmp0/test0.h5'
f = tb.openFile( h5File, 'w')
atom = tb.Atom.from_dtype( numpy.dtype('Float32'))
x = f.createCArray( f.root, 'prices', atom=atom, shape=(10000, 100) )
In this example I could simply do x[:]=nr.random(10000,100), but in reality my matrix is much bigger, more like (100,000,000 x 500). So I need to do it by chunks. First I tried vertically:
%%timeit
for k in xrange(100) :
x[ :, k ] = nr.random( 10000 )
1 loops, best of 3: 255 ms per loop
Then I tried horizontally:
%%timeit
for k in xrange(0, 10000, 100) :
x[ k:k+100, : ] = nr.random( ( 100, 100 ) )
100 loops, best of 3: 22.4 ms per loop
Why is the horizontal one 10 times faster? Also, is there a simpler way to do that?
For the speed, its because of how computers keep memory organized.
Internally, the entire matrix is kept in linear memory. To keep it easy to wrap our heads around: if you had a 4x4 matrix:
1 2
3 4
Internally, that would be stored as
memAddr1: 1
memAddr2: 2
memAddr3: 3
memAddr4: 4
So if you write this in rows, you get very efficient use of consecutive memory addresses (1-4). If you write in columns, you're forcing frequent random accesses (1 then 3 then 2 then 4).
The reason has been exposed already: the differences in how you store data in memory affects a lot to the performance you get. For knowing more about the issue, look at the slide 19 (and neighborhoods) of this presentation:
http://www.pytables.org/docs/StarvingCPUs-PyTablesUsages.pdf

Categories

Resources