Tf-idf calculation using gensim - python

I have one tf-idf example from an ISI paper. I’m trying to validate my code by this example. But I get different result from my code.I don’t know what the reason is!
Term-document matrix from paper:
acceptance [ 0 1 0 1 1 0
information 0 1 0 1 0 0
media 1 0 1 0 0 2
model 0 0 1 1 0 0
selection 1 0 1 0 0 0
technology 0 1 0 1 1 0]
Tf-idf matrix from paper:
acceptance [ 0 0.4 0 0.3 0.7 0
information 0 0.7 0 0.5 0 0
media 0.3 0 0.2 0 0 1
model 0 0 0.6 0.5 0 0
selection 0.9 0 0.6 0 0 0
technology 0 0.4 0 0.3 0.7 0]
My tf-idf matrix:
acceptance [ 0 0.4 0 0.3 0.7 0
information 0 0.7 0 0.5 0 0
media 0.5 0 0.4 0 0 1
model 0 0 0.6 0.5 0 0
selection 0.8 0 0.6 0 0 0
technology 0 0.4 0 0.3 0.7 0]
My code:
tfidf = models.TfidfModel(corpus)
corpus_tfidf=tfidf[corpus]
I’ve tried another code like this:
transformer = TfidfTransformer()
tfidf=transformer.fit_transform(counts).toarray() ##counts is term-document matrix
But I didn’t get appropriate answer

The reason of this difference between results as you mentioned is that there are many methods to calculate TF-IDF in papers. if you read Wikipedia TF-IDF page it mentioned that TF-IDF is calculated as
tfidf(t,d,D) = tf(t,d) . idf(t,D)
and both of tf(t,d) and idf(t,D) can be calculated with different functions that will change last result of TF_IDF value. Actually functions are different for their usage in different applications.
Gensim TF-IDF Model can calculate any function for tf(t,d) and idf(t,D) as it mentioned in it's documentation.
Compute tf-idf by multiplying a local component (term frequency) with
a global component (inverse document frequency), and normalizing the
resulting documents to unit length. Formula for unnormalized weight of
term i in document j in a corpus of D documents:
weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})
or, more generally:
weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)
so you can plug in your own custom wlocal and wglobal functions.
Default for wlocal is identity (other options: math.sqrt, math.log1p,
...) and default for wglobal is log_2(total_docs / doc_freq), giving
the formula above.
Now if you want to reach exactly the paper result, you must know what functions it used for calculating TF-IDF matrix.
Also there is a good example in Gensim google group that shows how you can use custom function for calculating TF-IDF.

Related

Should the background dataset for shap be standardized?

So I am trying to explain a basic SVM model using SHAP. The inputs to the SVM model however are standardized (I used StandardScaler().fit() and then transformed the datapoints using StandardScaler so that they can be used on the SVM model).
My question is now when using SHAP I need to give it a background distribution. Usually the input to this background distribution looks like this:
background_distribution = KMeans(n_clusters=10,random_state=0).fit(xtrain).cluster_centers_
However I wanted to use my own custom background distribution, which contains select data points. Does this mean the data points need to be standardized as well? i.e instead of looking like
[ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
they look like this
[ 0.67028006 -0.18887347 0.90860212 -0.41342579 0.26204266 0.55080012
-0.85479154 0.13743146 -0.70749448 -0.42919754 1.21628074 -0.71418983
-0.26726124 -0.52247913 -0.34755864 0.31234752 -0.23208655 -0.63565412
-0.40904178 0. 4.89897949 -0.23473314 0.64082627 -0.46852129
-0.26726124 -0.44542354 1.15657353 0.53795751]
For clarity: I am asking whether after retrieving my points, I need to standardize the background data set, since my original data points are scaled for use in the model, however my background distribution contains non scaled data points.
The model training looks like this:
ss = StandardScaler().fit(X)
xtrain = ss.transform(xtrain) #Changes values to make them ML compatible -not needed for trees
xtest = ss.transform(xtest)
support_vector_classifier = SVC(kernel='rbf')
support_vector_classifier.fit(xtrain,ytrain)
y_pred_svc = support_vector_classifier.predict(xtest)
Option A:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B. Your background should be preprocessed in the same way as your training data
is close.
This is the case in any situation in ML when you preprocess data -- should you split your data for train, test, validate, should you feed your data for prediction to trained model -- you always apply the same transformations to all parts of your data, sometimes manually, sometimes through pipeline. SHAP is not an exception from this principle.
However, you may think about the following as well: your scaler should be trained on the trained data before applying to test or background data. You can't train it on test or validate or background data because this would sound as if for predicting future you first asking to show it to you ("data leakage" as they call it ML).
This means, you can't:
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
Rather:
ss = StandardScaler().fit(X_train)
background_distribution = ss.transform(background_distribution)

Filling a matrix with a colour gradient pattern according to diamond geometry

Context
Suppose I have a square ABCD within which I have the geometry EFGH, where the lines EF and GH are on the sides AB and CD:
And I would like to add a colour pattern like near vertical beam resembling EFGH below:
Question
How could I ensure the center of the pattern matches the centerline, whilst the pattern gradient spread matches the spread of the geometry (from top to bottom in this case)?
MWE
The following Python code specifies the geometries and the resolution = the size of the matrix, where each pixel encodes a colour value.
import numpy as np
resolution=400 # number of pixels
# Specify rectangle
A=(1,6)
B=(6,6)
C=(6,1)
D=(1,1)
# Specify geometry
E=(2,6)
F=(3,6)
G=(4,1)
H=(5.5,1)
gradient_matrix = np.zero([sresolution, resolution])
Expected Output
Ideally, the colour gradient matrix would look something like:
0 0 0.1 0.3 0.6 0.9 0 0 0 0 0 0
0 0 0.1 0.2 0.4 0.6 0.9 0 0 0 0 0
0 0 0 0.1 0.2 0.4 0.6 0.8 0.9 0 0 0
0 0 0 0.1 0.2 0.4 0.5 0.7 0.8 9 0 0
0 0 0 0 0.1 0.2 0.4 0.5 0.7 0.8 9 0
where it goes from green=0, to light=0.5 to red =0.9 with a slowly widening spread from top to bottom.

ALS algorithm in Dask optimization

I am trying to implement ALS algorithm in Dask, but I am having trouble figuring out how to compute latent feautures in one step. I followed formulas on this stackoverflow thread and come up with this code:
Items = da.linalg.lstsq(da.add(da.dot(Users, Users.T), lambda_ * da.eye(n_factors)),
da.dot(Users, X))[0].T.compute()
Items = np.where(Items < 0, 0, Items)
Users = da.linalg.lstsq(da.add(da.dot(Items.T, Items), lambda_ * da.eye(n_factors)),
da.dot(Items.T, X.T))[0].compute()
Users = np.where(Users < 0, 0, Users)
But I don't think this works correctly, because MSE is not decreasing.
Example input:
n_factors = 2
lambda_ = 0.1
# We have 6 users and 4 items
Matrix X_train(6x4), R(4x6), Users(2x6) and Items(4x2) looks like:
1 0 0 0 5 2 1 0 0 0 0.8 1.3 1.1 0.2 4.1 1.6
0 0 0 0 4 0 0 0 1 1 3.9 4.3 3.5 2.7 4.3 0.5
0 3 0 0 4 0 0 0 0 0 2.9 1.5
0 3 0 0 0 0 0 0 0 0 0.2 4.7
1 1 1 0 0.9 1.1
1 0 0 0 4.8 3.0
EDIT: I found the problem, but I don't know how to get around it. Before the iteration starts I set all values in X_train matrix, where there is no rating, to 0.
X_train = da.nan_to_num(X_train)
Reason for that is because dot product works only on numeric values. But because the matrix is very sparse 90% of it now consists of zeros. And insted of fiting real ratings in the matrix it fits this zeros.
Any help would be highly appreciated. <3
One way to handle gaps or missing values in data sets is to use masked arrays. As of May 2017 Dask also supports them.
Defining a masked array in Dask is fairly simple and simmilar to numpy's. All supported functions are listed in docs, here are just some most commonly used approaches:
data_set = da.array([[1, 2], [3, 4]])
masked_data_set_1 = da.ma.masked_array(data_set, mask=[[False, True],[True, False]])
# returns [[1, --],[--, 4]]
masked_data_set_2 = da.ma.masked_equal(data_set, 4)
# returns [[1, 2],[3, --]]
masked_data_set_3 = da.ma.masked_where(data_set < 3, data_set)
# returns [[--, --],[3, 4]]
In your case, you are trying to perform dot product of da.dot(Users, X)). Instead of setting all NaN values to 0, you can use masked array as:
masked_X = da.ma.masked_where(X != X, X)
Now you can easily perform dot product like:
da.ma.getdata(da.dot(Users,masked_X))

Assigning a label to its corresponding grid cell

I coded a YOLO model from scratch and have a numpy array which looks like this:
[
[[1 0 1 0.4 0.3 0.2 0.1]
[1 1 0 0.2 0.3 0.4 0.5]
[0 0 0 0 0 0 0]]
...]
This is how it would look in a pandas object:
Obj_score c1 c2 x y h w
1 0 1 0.4 0.3 0.2 0.1
1 1 0 0.2 0.3 0.4 0.5
0 0 0 0 0 0 0
In order to make my model work I have to convert the mentioned label tensor into a S*S*(B*5+C) tensor, where I have to put each label into its corresponding grid cell. How would I do that?
The model of mine makes 3 bounding box predictions (which is called B), 2 class predictions (which is called C), and it S = 7.
How would I put my labels into its corresponding grid cell (by using numpy or keras)?
If it would help here some code from Vivek Maskara's solution to this issue which can also be found in his article about implementing YOLO v1 in scratch.

Pick random coordinates in Numpy array based on condition

I have used convolution2d to generate some statistics on conditions of local patterns. To be complete, I'm working with images and the value 0.5 is my 'gray-screen', I cannot use masks before this unfortunately (dependence on some other packages). I want to add new objects to my image, but it should overlap at least 75% of non-gray-screen. Let's assume the new object is square, I mask the image on gray-screen versus the rest, do a 2-d convolution with a n by n matrix filled with 1s so I can get the sum of the number of gray-scale pixels in that patch. This all works, so I have a matrix with suitable places to place my new object. How do I efficiently pick a random one from this matrix?
Here is a small example with a 5x5 image and a 2x2 convolution matrix, where I want a random coordinate in my last matrix with a 1 (because there is at most 1 0.5 in that patch)
Image:
1 0.5 0.5 0 1
0.5 0.5 0 1 1
0.5 0.5 1 1 0.5
0.5 1 0 0 1
1 1 0 0 1
Convolution matrix:
1 1
1 1
Convoluted image:
3 3 1 0
4 2 0 1
3 1 0 1
1 0 0 0
Conditioned on <= 1:
0 0 1 1
0 0 1 1
0 1 1 1
1 1 1 1
How do I get a uniformly distributed coordinate of the 1s efficiently?
np.where and np.random.randint should do the trick :
#we grab the indexes of the ones
x,y = np.where(convoluted_image <=1)
#we chose one index randomly
i = np.random.randint(len(x))
random_pos = [x[i],y[i]]

Categories

Resources