ALS algorithm in Dask optimization - python

I am trying to implement ALS algorithm in Dask, but I am having trouble figuring out how to compute latent feautures in one step. I followed formulas on this stackoverflow thread and come up with this code:
Items = da.linalg.lstsq(da.add(da.dot(Users, Users.T), lambda_ * da.eye(n_factors)),
da.dot(Users, X))[0].T.compute()
Items = np.where(Items < 0, 0, Items)
Users = da.linalg.lstsq(da.add(da.dot(Items.T, Items), lambda_ * da.eye(n_factors)),
da.dot(Items.T, X.T))[0].compute()
Users = np.where(Users < 0, 0, Users)
But I don't think this works correctly, because MSE is not decreasing.
Example input:
n_factors = 2
lambda_ = 0.1
# We have 6 users and 4 items
Matrix X_train(6x4), R(4x6), Users(2x6) and Items(4x2) looks like:
1 0 0 0 5 2 1 0 0 0 0.8 1.3 1.1 0.2 4.1 1.6
0 0 0 0 4 0 0 0 1 1 3.9 4.3 3.5 2.7 4.3 0.5
0 3 0 0 4 0 0 0 0 0 2.9 1.5
0 3 0 0 0 0 0 0 0 0 0.2 4.7
1 1 1 0 0.9 1.1
1 0 0 0 4.8 3.0
EDIT: I found the problem, but I don't know how to get around it. Before the iteration starts I set all values in X_train matrix, where there is no rating, to 0.
X_train = da.nan_to_num(X_train)
Reason for that is because dot product works only on numeric values. But because the matrix is very sparse 90% of it now consists of zeros. And insted of fiting real ratings in the matrix it fits this zeros.
Any help would be highly appreciated. <3

One way to handle gaps or missing values in data sets is to use masked arrays. As of May 2017 Dask also supports them.
Defining a masked array in Dask is fairly simple and simmilar to numpy's. All supported functions are listed in docs, here are just some most commonly used approaches:
data_set = da.array([[1, 2], [3, 4]])
masked_data_set_1 = da.ma.masked_array(data_set, mask=[[False, True],[True, False]])
# returns [[1, --],[--, 4]]
masked_data_set_2 = da.ma.masked_equal(data_set, 4)
# returns [[1, 2],[3, --]]
masked_data_set_3 = da.ma.masked_where(data_set < 3, data_set)
# returns [[--, --],[3, 4]]
In your case, you are trying to perform dot product of da.dot(Users, X)). Instead of setting all NaN values to 0, you can use masked array as:
masked_X = da.ma.masked_where(X != X, X)
Now you can easily perform dot product like:
da.ma.getdata(da.dot(Users,masked_X))

Related

Stratified Sampling in Python without scikit-learn

I have a vector which contains 10 values of sample 1 and 25 values of sample 2.
Fact = np.array((2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2))
I want to create a stratified output vector where :
sample 1 is divided in 80% : 8 values of 1 and 20% : 2 values of 0.
sample 2 is divided in 80% : 20 values of 1 and 20% : 5 values of 0.
The expected output will be :
Output = np.array((0,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1))
How can I automate this ? I can’t use the sampling function from scikit-learn because it is not for a machine learning experience.
Here is one way to get your desired result, with reproducibility of output added. We draw random index values for each of the two groups from the input (fact) array, without replacement. Then, we create a new output array where we assign 1's in locations corresponding to the drawn index values and assign 0's everywhere else.
import numpy as np
from numpy.random import RandomState
rng = RandomState(123)
fact = np.array(
(2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2),
dtype='int8'
)
idx_arr = np.hstack(
(
rng.choice(np.argwhere(fact == 1).flatten(), 8, replace=False),
rng.choice(np.argwhere(fact == 2).flatten(), 20, replace=False),
)
)
out = np.zeros_like(fact, dtype='int8')
np.put(out, idx_arr, 1)
print(out)
# [0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1]

Reclassification by column name in pandas

I would like to apply a test to a pandas dataframe, and create flags in a corresponding dataframe based on the test results. I've gotten this far:
import numpy as np
import pandas as pd
matrix = pd.DataFrame({'a': [1, 11, 2, 3, 4], 'b': [5, 6, 22, 8, 9]})
flags = pd.DataFrame(np.zeros(matrix.shape), columns=matrix.columns)
flag_values = pd.Series({"a": 100, "b": 200})
flags[matrix > 10] = flag_values
but this raises the error
ValueError: Must specify axis=0 or 1
Where can I specify the axis in this situation? Is there a better way to accomplish this?
Edit:
The result I'm looking for in this example for "flags" is
a b
0 0
100 0
0 200
0 0
0 0
You could define flags = (matrix > 10) * flag_values:
In [35]: (matrix > 10) * flag_values
Out[35]:
a b
0 0 0
1 100 0
2 0 200
3 0 0
4 0 0
This relies on True having numeric value 1 and False having numeric value 0.
It also relies on Pandas' nifty automatic alignment of DataFrames (and Series) based on labels before performing arithmetic operations.
mask with mul
flags.mask(matrix > 10,1).mul(flag_values,axis=1)
Out[566]:
a b
0 0.0 0.0
1 100.0 0.0
2 0.0 200.0
3 0.0 0.0
4 0.0 0.0

Tf-idf calculation using gensim

I have one tf-idf example from an ISI paper. I’m trying to validate my code by this example. But I get different result from my code.I don’t know what the reason is!
Term-document matrix from paper:
acceptance [ 0 1 0 1 1 0
information 0 1 0 1 0 0
media 1 0 1 0 0 2
model 0 0 1 1 0 0
selection 1 0 1 0 0 0
technology 0 1 0 1 1 0]
Tf-idf matrix from paper:
acceptance [ 0 0.4 0 0.3 0.7 0
information 0 0.7 0 0.5 0 0
media 0.3 0 0.2 0 0 1
model 0 0 0.6 0.5 0 0
selection 0.9 0 0.6 0 0 0
technology 0 0.4 0 0.3 0.7 0]
My tf-idf matrix:
acceptance [ 0 0.4 0 0.3 0.7 0
information 0 0.7 0 0.5 0 0
media 0.5 0 0.4 0 0 1
model 0 0 0.6 0.5 0 0
selection 0.8 0 0.6 0 0 0
technology 0 0.4 0 0.3 0.7 0]
My code:
tfidf = models.TfidfModel(corpus)
corpus_tfidf=tfidf[corpus]
I’ve tried another code like this:
transformer = TfidfTransformer()
tfidf=transformer.fit_transform(counts).toarray() ##counts is term-document matrix
But I didn’t get appropriate answer
The reason of this difference between results as you mentioned is that there are many methods to calculate TF-IDF in papers. if you read Wikipedia TF-IDF page it mentioned that TF-IDF is calculated as
tfidf(t,d,D) = tf(t,d) . idf(t,D)
and both of tf(t,d) and idf(t,D) can be calculated with different functions that will change last result of TF_IDF value. Actually functions are different for their usage in different applications.
Gensim TF-IDF Model can calculate any function for tf(t,d) and idf(t,D) as it mentioned in it's documentation.
Compute tf-idf by multiplying a local component (term frequency) with
a global component (inverse document frequency), and normalizing the
resulting documents to unit length. Formula for unnormalized weight of
term i in document j in a corpus of D documents:
weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})
or, more generally:
weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)
so you can plug in your own custom wlocal and wglobal functions.
Default for wlocal is identity (other options: math.sqrt, math.log1p,
...) and default for wglobal is log_2(total_docs / doc_freq), giving
the formula above.
Now if you want to reach exactly the paper result, you must know what functions it used for calculating TF-IDF matrix.
Also there is a good example in Gensim google group that shows how you can use custom function for calculating TF-IDF.

Pick random coordinates in Numpy array based on condition

I have used convolution2d to generate some statistics on conditions of local patterns. To be complete, I'm working with images and the value 0.5 is my 'gray-screen', I cannot use masks before this unfortunately (dependence on some other packages). I want to add new objects to my image, but it should overlap at least 75% of non-gray-screen. Let's assume the new object is square, I mask the image on gray-screen versus the rest, do a 2-d convolution with a n by n matrix filled with 1s so I can get the sum of the number of gray-scale pixels in that patch. This all works, so I have a matrix with suitable places to place my new object. How do I efficiently pick a random one from this matrix?
Here is a small example with a 5x5 image and a 2x2 convolution matrix, where I want a random coordinate in my last matrix with a 1 (because there is at most 1 0.5 in that patch)
Image:
1 0.5 0.5 0 1
0.5 0.5 0 1 1
0.5 0.5 1 1 0.5
0.5 1 0 0 1
1 1 0 0 1
Convolution matrix:
1 1
1 1
Convoluted image:
3 3 1 0
4 2 0 1
3 1 0 1
1 0 0 0
Conditioned on <= 1:
0 0 1 1
0 0 1 1
0 1 1 1
1 1 1 1
How do I get a uniformly distributed coordinate of the 1s efficiently?
np.where and np.random.randint should do the trick :
#we grab the indexes of the ones
x,y = np.where(convoluted_image <=1)
#we chose one index randomly
i = np.random.randint(len(x))
random_pos = [x[i],y[i]]

Numpy vectorized summation with variable number of factors

I am currently computing a function that contains a summation over an index. The index is between 0 and the integer part of T; ideally I would like to be able to compute this summation quickly for several values of T.
In a real-life case, most of the values of T are small, but a small percentage can be one or two orders of magnitude larger than the average.
What I am doing now is:
1) I define the vector T, e.g. (my real-life data have a much larger number of entries, it is just to give an idea):
import numpy as np
T = np.random.exponential(5, 10)
2) I create a matrix containing the factors between 0 and int(T), and then zeroes:
n = int(T.max())
j = ((np.arange(n) < T[:,np.newaxis])*np.arange(1,n+1)).astype(int).transpose()
print(j)
[[ 1 1 1 1 1 1 1 1 1 1]
[ 2 0 2 2 2 0 2 0 2 2]
[ 0 0 3 0 3 0 3 0 3 3]
[ 0 0 4 0 4 0 0 0 4 4]
[ 0 0 5 0 5 0 0 0 5 5]
[ 0 0 6 0 6 0 0 0 6 6]
[ 0 0 7 0 7 0 0 0 0 7]
[ 0 0 8 0 8 0 0 0 0 8]
[ 0 0 9 0 9 0 0 0 0 9]
[ 0 0 0 0 10 0 0 0 0 10]
[ 0 0 0 0 11 0 0 0 0 0]
[ 0 0 0 0 12 0 0 0 0 0]]
3) I generate the single elements of the summation, using a mask to avoid applying the function to the elements that are zero:
A = np.log(1 + (1 + j) * 5)* (j>0)
4) I sum along the columns:
A.sum(axis=0)
Obtaining:
array([ 5.170484 , 2.39789527, 29.96464821, 5.170484 ,
42.29052851, 2.39789527, 8.21500643, 2.39789527,
18.49060911, 33.9899999 ])
Is there a fastest/better way to vectorize that? I have the feeling that it is very slow due to the large amount of zeroes that do not contribute to the sum, but since I am a beginner with NumPy I couldn't figure out a better way of writing it.
EDIT: in my actual problem, the function applied to j depends also on a second parameter tau (in a vector of the same size of T). So the items contained in every column are not the same.
Looking at your j, for each column it has numbers going from 1 to N, where N is being decided based on each T element. Then, you are summing along each column, which is the same as summing until N because rest of the elements are zeros anyway. Those summed values could be calculated with np.cumsum and those N values that are basically the limits of each column in j could be directly calculated from T. These N values are then used as indices to index into the cumsum-ed values to give us the final output.
This should be pretty fast and memory efficient, given that cumsum is the only computation done and that too on a 1D array, as compared to the summation done in the original approach on a 2D array along each column. Thus, we have a vectorized approach like so -
n = int(T.max())
vals = (np.log(1 + (1 + np.arange(1,n+1)) * 5)).cumsum()
out = vals[(T.astype(int)).clip(max=n-1)]
In terms of memory usage, we are generating three variables -
n : Scalar
vals : 1D array of n elements
out : 1D array of T.size elements (this is the output anyway)
Runtime test and verify output -
In [5]: def original_app(T):
...: n = int(T.max())
...: j = ((np.arange(n) < T[:,None])*np.arange(1,n+1)).astype(int).transpose()
...: A = np.log(1 + (1 + j) * 5)* (j>0)
...: return A.sum(axis=0)
...:
...: def vectorized_app(T):
...: n = int(T.max())
...: vals = (np.log(1 + (1 + np.arange(1,n+1)) * 5)).cumsum()
...: return vals[(T.astype(int)).clip(max=n-1)]
...:
In [6]: # Input array
...: T = np.random.exponential(5, 10000)
In [7]: %timeit original_app(T)
100 loops, best of 3: 9.62 ms per loop
In [8]: %timeit vectorized_app(T)
10000 loops, best of 3: 50.1 µs per loop
In [9]: np.allclose(original_app(T),vectorized_app(T)) # Verify outputs
Out[9]: True

Categories

Resources