python numpy - improve efficiency on column-wise cosine similarity

python numpy - improve efficiency on column-wise cosine similarity - python

I am fairly new to programming and I never used numpy before.
So, I have a matrix with 19001 x 19001 dimensions. It contains a lot of zeros, so it is relatively sparse. I wrote some code to compute the pairwise cosine similarity of the columns if the item in the row is non-zero. I add all the pairwise similarity values of one row and do some mathematical operations on them to obtain one value for each row of the matrix in the end (see code below). It does what it is supposed to, however as dealing with a great number of dimensions, it is really slow. Is there any way to modify my code to make it more efficient?
import numpy as np
from scipy.spatial.distance import cosine
row_number = 0
out_file = open('outfile.txt', 'w')
for row in my_matrix:
non_zeros = np.nonzero(my_matrix[row_number])[0]
non_zeros = list(non_zeros)
cosine_sim = []
for item in non_zeros:
if len(non_zeros) <= 1:
break
x = non_zeros[0]
y = non_zeros[1]
similarity = 1 - cosine(my_matrix[:, x], my_matrix[:, y])
cosine_sim.append(similarity)
non_zeros.pop(0)
summing = np.sum(cosine_sim)
mean = summing / len(cosine_sim)
log = np.log(mean)
out_file_value = log * -1
out_file.write(str(row_number) + " " + str(out_file_value) + "\n")
if row_number <= 19000:
row_number += 1
else:
break
I know that there are some function to actually compute the cosine similarity even between columns (from sklearn.metrics.pairwise import cosine_similarity), so I tried it. However, the output is kind of the same but on the same time really confusing to me even though I read the documentation and the posts on this page referring to the issue.
For instance:
my_matrix =[[0. 0. 7. 0. 5.]
[0. 0. 11. 0. 0.]
[0. 2. 0. 0. 0.]
[0. 0. 2. 11. 5.]
[0. 0. 5. 0. 0.]]
transposed = np.transpose(my_matrix)
sim_matrix = cosine_similarity(transposed)
# resulting similarity matrix
sim_matrix =[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0.14177624 0.45112924]
[0. 0. 0.14177624 1. 0.70710678]
[0. 0. 0.45112924 0.70710678 1.]]
If I compute the cosine similarity with my code above, it returns 0.45112924 for the 1st row ([0]) and 0.14177624 and 0.70710678 for row 4 ([3]).
out_file.txt
0 0.796001425306
1 nan
2 nan
3 0.856981065776
4 nan
I greatly appreciate any help or suggestions to my question!

You can consider using scipy instead. However, it doesn't take sparse matrix input. You have to provide numpy array.
import scipy.sparse as sp
from scipy.spatial.distance import cdist
X = np.random.randn(10000, 10000)
D = cdist(X, X.T, metric='cosine') # cosine distance matrix between 2 columns
Here is the speed that I got for 10000 x 10000 random array.
%timeit cdist(X, X.T, metric='cosine')
16.4 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try on small array
X = np.array([[1,0,1], [0, 3, 2], [1,0,1]])
D = cdist(X, X.T, metric='cosine')
This will give
[[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]
[ 6.07767730e-01 1.67949706e-01 9.41783727e-02]
[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]]
For example D[0, 2] is the cosine distance between column 0 and 2
from numpy.linalg import norm
1 - np.dot(X[:, 0], X[:,2])/(norm(X[:, 0]) * norm(X[:,2])) # give 0.422649

Related

Is it possible to find similarities between rows in a matrix without loop?

i have a 2D numpy array. I'm trying to compute the similarities between rows and put it into a similarities array. Is this possible without loop? Thanks for your time!
# ratings.shape = (943, 1682)
arri = np.zeros(943)
arri = np.where(arri == 0)[0]
arrj = np.zeros(943)
arrj = np.where(arrj ==0)[0]
similarities = np.zeros((ratings.shape[0], ratings.shape[0]))
similarities[arri, arrj] = np.abs(ratings[arri]-ratings[arrj])
I want to make a 2D-array similarities in that similarities[i, j] is the differentiation between row i and row j in ratings
[ValueError: shape mismatch: value array of shape (943,1682) could not be broadcast to indexing result of shape (943,)]
[1][1]: https://i.stack.imgur.com/gtst9.png

The problem is how numpy iterates through the array when indexing a two-dimentional array with two arrays.
First some setup:
import numpy;
ratings = numpy.arange(1, 6)
indicesX = numpy.indices((ratings.shape[0],1))[0]
indicesY = numpy.indices((ratings.shape[0],1))[0]
ratings: [1 2 3 4 5]
indicesX: [[0][1][2][3][4]]
indicesY: [[0][1][2][3][4]]
Now lets see what your program produces:
similarities = numpy.zeros((ratings.shape[0], ratings.shape[0]))
similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[0])
similarities:
[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[0. 0. 0. 3. 0.]
[0. 0. 0. 0. 4.]]
As you can see, numpy iterates over similarities basically like the following:
for i in range(5):
similarities[indicesX[i], indicesY[i]] = numpy.abs(ratings[i]-ratings[0])
similarities:
[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[0. 0. 0. 3. 0.]
[0. 0. 0. 0. 4.]]
Now instead we need indices like the following to iterate through the entire array:
indecesX = [0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]
indecesY = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4]
We do that the following:
# Reshape indicesX from (x,1) to (x,). Thats important for numpy.tile().
indicesX = indicesX.reshape(indicesX.shape[0])
indicesX = numpy.tile(indicesX, ratings.shape[0])
indicesY = numpy.repeat(indicesY, ratings.shape[0])
indicesX: [0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4]
indicesY: [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4]
Perfect! Now just call similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[indicesY]) again and we see:
similarities:
[[0. 1. 2. 3. 4.]
[1. 0. 1. 2. 3.]
[2. 1. 0. 1. 2.]
[3. 2. 1. 0. 1.]
[4. 3. 2. 1. 0.]]
Here the whole code again:
import numpy;
ratings = numpy.arange(1, 6)
indicesX = numpy.indices((ratings.shape[0],1))[0]
indicesY = numpy.indices((ratings.shape[0],1))[0]
similarities = numpy.zeros((ratings.shape[0], ratings.shape[0]))
indicesX = indicesX.reshape(indicesX.shape[0])
indicesX = numpy.tile(indicesX, ratings.shape[0])
indicesY = numpy.repeat(indicesY, ratings.shape[0])
similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[indicesY])
print(similarities)
PS
You commented on your own post to improve it. You should edit your question instead of commenting on it, when you want to improve it.

Linear least-squares solution for 3d inputs

Problem
Say I have two arrays with the following shapes:
y.shape is (z, b). Picture this as a collection of z (b,) y vectors.
x.shape is (z, b, c). Picture this as a collection of z (b, c) multivariate x matrices.
My intent is to find the z independent vectors of least-squares coefficient solutions. I.e. the first solution is from regressing y[0] on x[0], where those inputs have shape (b, ) and (b, c) respectively. (b observations, c features.) The result would be shape (z, c).
Some example data
np.random.seed(123)
x = np.random.randn(190, 20, 3)
y = np.random.randn(190, 20) # Assumes no intercept term
# First vector of coefficients
np.linalg.lstsq(x[0], y[0])[0]
# array([-0.12823781, -0.3055392 , 0.11602805])
# Last vector of coefficients
np.linalg.lstsq(x[-1], y[-1])[0]
# array([-0.02777503, -0.20425779, 0.22874169])
NumPy's least-squares solver lstsq can't operate on these. (With my intended result being shape (190, 3), or 190 vectors of 3 coefficients each. Each (3,) vector is one coefficient set from regressions with n=20.)
Is there a workaround to get to the coefficient matrices wrapped into one result array? I'm thinking possibly of the matrix formulation:
For a 1d y and 2d x this would just be:
def coefs(y, x):
return np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))
but I'm having trouble getting this to accept a 2d y and 3d x as above.
Lastly, I'm curious as to why lstsq has trouble here. Is there a simple answer as to why the inputs must be at most 2d?

Here is some demo to demonstrate:
the problems mentioned in my comments
a mostly empirical analysis of looped-lstsq vs. one-step-embedded-lstsq
(with some surprising result at the end which is to be taken with a grain of salt):
Code
import numpy as np
import scipy.sparse as sp
from sklearn.datasets import make_regression
from time import perf_counter as pc
np.set_printoptions(edgeitems=3,infstr='inf',
linewidth=160, nanstr='nan', precision=1,
suppress=False, threshold=1000, formatter=None)
""" Create task """
Z, B, C = 4, 3, 2
Zs = []
Bs = []
for i in range(Z):
X, y, = make_regression(n_samples=B, n_features=C, random_state=i)
Zs.append(X)
Bs.append(y)
Zs = np.array(Zs)
Bs = np.array(Bs)
""" Independent looping """
print('LOOPED CALLS')
start = pc()
result = np.empty((Z, C))
for z in range(Z):
result[z] = np.linalg.lstsq(Zs[z], Bs[z])[0]
end = pc()
print('lhs-shape: ', Zs.shape)
print('lhs-dense-fill-ratio: ', np.count_nonzero(Zs) / np.product(Zs.shape))
print('used time: ', end-start)
print(result)
""" Embedding in one """
print('EMBEDDING INTO ONE CALL')
Zs_ = sp.block_diag([Zs[i] for i in range(Z)]).todense() # convenient to use scipy.sparse
# oops: there is a dense-one too:
# -> scipy.linalg.block_diag
Bs_ = Bs.flatten()
start = pc() # one could argue if transform above should be timed too!
result_ = np.linalg.lstsq(Zs_, Bs_)[0]
end = pc()
print('lhs-shape: ', Zs_.shape)
print('lhs-dense-fill-ratio: ', np.count_nonzero(Zs_) / np.product(Zs_.shape))
print('used time: ', end-start)
print(result_)
Output
LOOPED CALLS
lhs-shape: (4, 3, 2)
lhs-dense-fill-ratio: 1.0
used time: 0.0005415275241778155
[[ 89.2 43.8]
[ 68.5 41.9]
[ 61.9 20.5]
[ 5.1 44.1]]
EMBEDDING INTO ONE CALL
lhs-shape: (12, 8)
lhs-dense-fill-ratio: 0.25
used time: 0.00015907748341232328
[ 89.2 43.8 68.5 41.9 61.9 20.5 5.1 44.1]
lstsq problem-dimensions for each case
While the original data looks like:
[[[ 2.2 1. ]
[-1. 1.9]
[ 0.4 1.8]]
[[-1.1 -0.5]
[-2.3 0.9]
[-0.6 1.6]]
[[ 1.6 -2.1]
[-0.1 -0.4]
[-0.8 -1.8]]
[[-0.3 -0.4]
[ 0.1 -1.9]
[ 1.8 0.4]]]
[[ 242.7 -5.4 112.9]
[ -95.7 -121.4 26.2]
[ 57.9 -12. -88.8]
[ -17.1 -81.6 28.4]]
and each solve looks like:
LHS
[[ 2.2 1. ]
[-1. 1.9]
[ 0.4 1.8]]
RHS
[ 242.7 -5.4 112.9]
the embedded problem (one solving-step) looks like:
LHS
[[ 2.2 1. 0. 0. 0. 0. 0. 0. ]
[-1. 1.9 0. 0. 0. 0. 0. 0. ]
[ 0.4 1.8 0. 0. 0. 0. 0. 0. ]
[ 0. 0. -1.1 -0.5 0. 0. 0. 0. ]
[ 0. 0. -2.3 0.9 0. 0. 0. 0. ]
[ 0. 0. -0.6 1.6 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 1.6 -2.1 0. 0. ]
[ 0. 0. 0. 0. -0.1 -0.4 0. 0. ]
[ 0. 0. 0. 0. -0.8 -1.8 0. 0. ]
[ 0. 0. 0. 0. 0. 0. -0.3 -0.4]
[ 0. 0. 0. 0. 0. 0. 0.1 -1.9]
[ 0. 0. 0. 0. 0. 0. 1.8 0.4]]
RHS
[ 242.7 -5.4 112.9 -95.7 -121.4 26.2 57.9 -12. -88.8 -17.1 -81.6 28.4]
There is no way, given the assumptions / standard-form of lstsq to embed this independence-assumption without introducing a lot of zeros!
lstsq is:
not able to exploit sparsity as the core-algorithm is a dense-one
take a look at the transformed shape: this will be heavy in terms of memory and computation!
not able to use information from fit 0 to speed up something in fit 1
they are independent after all; no information gain in theory
able to vectorize a lot (but that's not helping in general)
Your example-shapes
Trimmed output for your specific shapes, this time: test a sparse-solver too:
Added code (at the end)
print('EMBEDDING -> sparse-solver')
Zs_ = sp.csc_matrix(Zs_) # sparse!
start = pc()
result__ = sp.linalg.lsmr(Zs_, Bs_)[0]
end = pc()
print('lhs-shape: ', Zs_.shape)
print('lhs-dense-fill-ratio: ', Zs_.nnz / np.product(Zs_.shape))
print('used time: ', end-start)
print(result__)
Output
LOOPED CALLS
lhs-shape: (190, 20, 3)
lhs-dense-fill-ratio: 1.0
used time: 0.01716980329027777
[ 11.9 31.8 29.6]
...
[ 44.8 28.2 62.3]]
EMBEDDING INTO ONE CALL
lhs-shape: (3800, 570)
lhs-dense-fill-ratio: 0.00526315789474
used time: 0.6774500271820254
[ 11.9 31.8 29.6 ... 44.8 28.2 62.3]
EMBEDDING -> sparse-solver
lhs-shape: (3800, 570)
lhs-dense-fill-ratio: 0.00526315789474
used time: 0.0038423098412817547 # a bit of a surprise
[ 11.9 31.8 29.6 ... 44.8 28.2 62.3]
Conclusion
In general: solve independently!
In some cases, the task above will be solved faster when using the sparse-solver approach, but analysis here is hard as we are comparing two completely different algorithms (direct vs. iterative) and the results might change in some dramatical way for other data.

Here is the linear algebra solution, with the speed right on par with #sascha's looped version for smaller arrays.
print('Matrix formulation')
start = pc()
result = np.squeeze(np.matmul(np.linalg.inv(np.matmul(Zs.swapaxes(1,2), Zs)),
np.matmul(Zs.swapaxes(1,2), np.atleast_3d(Bs))))
end = pc()
print('used time: ', end-start)
print(result)
Ouput:
Matrix formulation
used time: 0.00015713176480858237
[[ 89.2 43.8]
[ 68.5 41.9]
[ 61.9 20.5]
[ 5.1 44.1]]
However, #sascha's answer wins out easily for much larger inputs, especially as the size of the third dimension grows (number of exogenous variables/features).
Z, B, C = 400, 300, 20
Zs = []
Bs = []
for i in range(Z):
X, y, = make_regression(n_samples=B, n_features=C, random_state=i)
Zs.append(X)
Bs.append(y)
Zs = np.array(Zs)
Bs = np.array(Bs)
# --------
print('Matrix formulation')
start = pc()
result = np.squeeze(np.matmul(np.linalg.inv(np.matmul(Zs.swapaxes(1,2), Zs)),
np.matmul(Zs.swapaxes(1,2), np.atleast_3d(Bs))))
end = pc()
print('used time: ', end-start)
print(result)
# --------
print('Looped calls')
start = pc()
result = np.empty((Z, C))
for z in range(Z):
result[z] = np.linalg.lstsq(Zs[z], Bs[z])[0]
end = pc()
print('used time: ', end-start)
print(result)
Output:
Matrix formulation
used time: 0.24000779996413257
[[ 1.2e+01 1.3e-15 6.3e+01 ..., -8.9e-15 5.3e-15 -1.1e-14]
[ 5.8e+01 2.7e-14 -4.8e-15 ..., 8.5e+01 -1.5e-14 1.8e-14]
[ 1.2e+01 -1.2e-14 4.4e-16 ..., 6.0e-15 8.6e+01 6.0e+01]
...,
[ 2.9e-15 6.6e+01 1.1e-15 ..., 9.8e+01 -2.9e-14 8.4e+01]
[ 2.8e+01 6.1e+01 -1.2e-14 ..., -2.5e-14 6.3e+01 5.9e+01]
[ 7.0e+01 3.3e-16 8.4e+00 ..., 4.1e+01 -6.2e-15 5.8e+01]]
Looped calls
used time: 0.17400113389658145
[[ 1.2e+01 7.1e-15 6.3e+01 ..., -2.8e-14 1.1e-14 -4.8e-14]
[ 5.8e+01 -5.7e-14 -4.9e-14 ..., 8.5e+01 -5.3e-15 6.8e-14]
[ 1.2e+01 3.6e-14 4.5e-14 ..., -3.6e-15 8.6e+01 6.0e+01]
...,
[ 6.3e-14 6.6e+01 -1.4e-13 ..., 9.8e+01 2.8e-14 8.4e+01]
[ 2.8e+01 6.1e+01 -2.1e-14 ..., -1.4e-14 6.3e+01 5.9e+01]
[ 7.0e+01 -1.1e-13 8.4e+00 ..., 4.1e+01 -9.4e-14 5.8e+01]]

How to add a dot in python numpy ndarray - data type issue

I have a NumPy ndarray that looks like:
[[ 0 0 0 1 0]
[ 0 0 0 0 1]]
but I would like to process it to the following form:
[[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
How would I achieve this?

It looks to me like you have an array of some integer type. You probably want to convert to an array of float:
array_float = array_int.astype(float)
e.g.:
>>> ones_i = np.ones(10, dtype=int)
>>> print ones_i
[1 1 1 1 1 1 1 1 1 1]
>>> ones_f = ones_i.astype(float)
>>> print ones_f
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
With that said, I think that it is worth asking why you want to process the string representation of your array. There very well might be a better way to accomplish your goal.

What is the correct way to mix feature sparse matrices with sklearn?

The other day I was dealing with a machine learning task that required to extract several types of feature matrices. I save this feature matrices as numpy arrays in disk in order to later use them in some estimator (this was a classification task). After all, when I wanted to use all the features I just concatenated the matrices in order to have a big feature matrix. When I obtained this big feature matrix I presented it to an estimator.
I do not know if this is the correct way to work with a feature matrix that has a lot of patterns (counts) in it. What other approaches should I use to mix correctly several types of features?. However, looking through the documentation I found FeatureUnion that seems to do this task.
For example, Let's say I would like to create a big feature matrix of 3 vectorizer approaches TfidfVectorizer, CountVectorizer and HashingVectorizer This is what I tried following the documentation example:
#Read the .csv file
import pandas as pd
df = pd.read_csv('file.csv',
header=0, sep=',', names=['id', 'text', 'labels'])
#vectorizer 1
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(use_idf=True, smooth_idf=True,
sublinear_tf=False, ngram_range=(2,2))
#vectorizer 2
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(2,2))
#vectorizer 3
from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(ngram_range=(2,2))
#Combine the above vectorizers in one single feature matrix:
from sklearn.pipeline import FeatureUnion
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
("bow", bow),
("hash",hash_vect)])
X_combined_features = combined_features.fit_transform(df['text'].values)
y = df['labels'].values
#Check the matrix
print X_combined_features.toarray()
Then:
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
Split the data:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_combined_features,y, test_size=0.33)
So I have a few questions:
Is this the right approach to mix several feature extractors in order to yield a big feature matrix? and assume I create my own "vectorizers" and they return sparse matrices, how can I use correctly the FeatureUnion interface to mix them with the above 3 features?.
update
Let's say that I have a matrix like this:
Matrix A ((152, 33))
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
Then with my vectorizer that returns a numpy array I get this feature matrix:
Matrix B ((152, 10))
[[4210 228 25 ..., 0 0 0]
[4490 180 96 ..., 10 4 6]
[4795 139 8 ..., 0 0 1]
...,
[1475 58 3 ..., 0 0 0]
[4668 256 25 ..., 0 0 0]
[1955 111 10 ..., 0 0 0]]
Matrix C ((152, 46))
[[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 17]
[ 0 0 0 ..., 0 0 0]
...,
[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 0]]
How can I merge A, B and C correctly with numpy.hstack,scipy.sparse.hstack or FeatureUnion? . Do you guys think this is a correct pipeline-approach to follow for any machine learning task?

Is this the right approach to mix several feature extractors in order to yield a big feature matrix?
In terms of correctness of the result, your approach is right, since FeatureUnion runs each individual transformer on the input data and concatenates the resulting matrices horizontally. However, it's not the only way, and which way is better in terms of efficiency will depend on your use case (more on this later).
Assume I create my own "vectorizers" and they return sparse matrices, how can I use correctly the FeatureUnion interface to mix them with the above 3 features?
Using FeatureUnion, you simply need to append your new transformer to the transformer list:
custom_vect = YourCustomVectorizer()
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
("bow", bow),
("hash", hash_vect),
("custom", custom_vect])
However, if your input data and most of the transformers are fixed (e.g. when you're experimenting with the inclusion of a new transformer), the above approach will lead to many re-computation. In that case, an alternative approach is to pre-compute store the intermediate results of the transformers (matrices or sparse matrices), and concatenate them manually using numpy.hstack or scipy.sparse.hstack when needed.
If your input data is always changing but the list of transformers is fixed, FeatureUnion offers more convenience. Another advantage of it is that it has the option of n_jobs, which helps you parallelize the fitting process.
Side note: It seems little bit strange to mix hashing vectorizer with the other vectorizers, since hashing vectorizer is typically used when you cannot afford to use the exact versions.

Fastest way to compute upper-triangular matrix of geometric series (Python)

and thanks in advance for the help.
Using Python (mostly numpy), I am trying to compute an upper-triangular matrix where each row "j" is the first j-terms of a geometric series, all rows using the same parameter.
For example, if my parameter is B (where abs(B)=<1, i.e. B in [-1,1]), then row 1 would be [1 B B^2 B^3 ... B^(N-1)], row 2 would be [0 1 B B^2...B^(N-2)] ... row N would be [0 0 0 ... 1].
This computation is key to a Bayesian Metropolis-Gibbs sampler, and so needs to be done thousands of times for new values of "B".
I have currently tried this two ways:
Method 1 - Mostly Vectorized:
B_Matrix = np.triu(np.dot(np.reshape(B**(-1*np.array(range(N))),(N,1)),np.reshape(B**(np.array(range(N))),(1,N))))
Essentially, this is the upper triangle part of a product of an Nx1 and 1xN set of matrices:
upper triangle ([1 B^(-1) B^(-2) ... B^(-(N-1))]' * [1 B B^2 B^3 ... B^(N-1)])
This works great for small N (algebraically it is correct), but for large N it errs out. And it produces errors out for B=0 (which should be allowed). I believe this is stemming from taking B^(-N) ~ inf for small B and large N.
Method 2:
B_Matrix = np.zeros((N,N))
B_Row_1 = B**(np.array(range(N)))
for n in range(N):
B_Matrix[n,n:] = B_Row_1[0:N-n]
So that just fills in the matrix row by row, but uses a loop which slows things down.
I was wondering if anyone had run into this before, or had any better ideas on how to compute this matrix in a faster way.
I've never posted on stackoverflow before, but didn't see this question anywhere, and thought I'd ask.
Let me know if there's a better place to ask this, and if I should provide anymore detail.

You could use scipy.linalg.toeplitz:
In [12]: n = 5
In [13]: b = 0.5
In [14]: toeplitz(b**np.arange(n), np.zeros(n)).T
Out[14]:
array([[ 1. , 0.5 , 0.25 , 0.125 , 0.0625],
[ 0. , 1. , 0.5 , 0.25 , 0.125 ],
[ 0. , 0. , 1. , 0.5 , 0.25 ],
[ 0. , 0. , 0. , 1. , 0.5 ],
[ 0. , 0. , 0. , 0. , 1. ]])
If your use of the array is strictly "read only", you can play tricks with numpy strides to quickly create an array that uses only 2*n-1 elements (instead of n^2):
In [55]: from numpy.lib.stride_tricks import as_strided
In [56]: def make_array(b, n):
....: vals = np.zeros(2*n - 1)
....: vals[n-1:] = b**np.arange(n)
....: a = as_strided(vals[n-1:], shape=(n, n), strides=(-vals.strides[0], vals.strides[0]))
....: return a
....:
In [57]: make_array(0.5, 4)
Out[57]:
array([[ 1. , 0.5 , 0.25 , 0.125],
[ 0. , 1. , 0.5 , 0.25 ],
[ 0. , 0. , 1. , 0.5 ],
[ 0. , 0. , 0. , 1. ]])
If you will modify the array in-place, make a copy of the result returned by make_array(b, n). That is, arr = make_array(b, n).copy().
The function make_array2 incorporates the suggestion #Jaime made in the comments:
In [30]: def make_array2(b, n):
....: vals = np.zeros(2*n-1)
....: vals[n-1] = 1
....: vals[n:] = b
....: np.cumproduct(vals[n:], out=vals[n:])
....: a = as_strided(vals[n-1:], shape=(n, n), strides=(-vals.strides[0], vals.strides[0]))
....: return a
....:
In [31]: make_array2(0.5, 4)
Out[31]:
array([[ 1. , 0.5 , 0.25 , 0.125],
[ 0. , 1. , 0.5 , 0.25 ],
[ 0. , 0. , 1. , 0.5 ],
[ 0. , 0. , 0. , 1. ]])
make_array2 is more than twice as fast as make_array:
In [35]: %timeit make_array(0.99, 600)
10000 loops, best of 3: 23.4 µs per loop
In [36]: %timeit make_array2(0.99, 600)
100000 loops, best of 3: 10.7 µs per loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python numpy - improve efficiency on column-wise cosine similarity - python

Related

Is it possible to find similarities between rows in a matrix without loop?

Linear least-squares solution for 3d inputs

How to add a dot in python numpy ndarray - data type issue

What is the correct way to mix feature sparse matrices with sklearn?

Fastest way to compute upper-triangular matrix of geometric series (Python)

Categories

Resources