What does the parameter "mds" mean in the pyLDAvis.sklearn.prepare () - function?

What does the parameter "mds" mean in the pyLDAvis.sklearn.prepare () - function? - python

I want to visualize the topic modeling made with the LDA-algorithm. I use the python module called "pyldavis" and as environment the jupyter notebook.
import pyLDAvis.sklearn
...
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')
It does work fine, but I don't really understand the mds-parameter... Even after reading the documentation:
mds :function or a string representation of function
A function that takes topic_term_dists as an input and outputs a n_topics by 2 distance matrix. The output approximates the distance between topics. See js_PCoA() for details on the default function. A string representation currently accepts pcoa (or upper case variant), mmds (or upper case variant) and tsne (or upper case variant), if sklearn package is installed for the latter two.
Does somebody know what the differences btw. mds='pcoa', mds='mmds', mds='tsne'?
Thanks!

Dimension reduction via Jensen-Shannon Divergence &
pcoa:Principal Coordinate Analysis(aka Classical Multidimensional Scaling)
mmds:Metric Multidimensional Scaling
tsne:t-distributed Stochastic Neighbor Embedding

Simply put: text data, when transformed into numeric tabular data, usually is high-dimensional. On the other hand, visualizations on a screen is two-dimensional (2D). Thus, a method of dimension reduction is required to bring the number of dimensions down to 2.
mds stands for multidimensional scaling. The possible values of that argument are:
mmds (Metric Multidimensional Scaling),
tsne (t-distributed Stochastic Neighbor Embedding), and
pcoa (Principal Coordinate Analysis),
All of them are dimension reduction methods.
Another method of dimension reduction that may be more familiar to you but not listed above is PCA (principal component analysis). They all share the similar idea of reducing dimensionality without losing too much information, backed by different theories and implementations.

Related

About custom operations in Tensorflow and PyTorch

I have to implement an energy function, termed Rigidity Energy, as in Eq 7 of this paper here.
The energy function takes as input two 3D object meshes, and returns the energy between them. The first mesh is the source mesh, and the second mesh is the deformed version of the source mesh. In rough psuedo-code, the computation would go like this:
Iterate over all the vertices in the source mesh.
For every vertex, compute its covariance matrix with its neighboring vertices.
Perform SVD on the computed covariance matrix and find the rotation matrix of the vertex.
Use the computed rotation matrix, the point coordinates in the original mesh and the corresponding coordinates in the deformed mesh, to compute the energy deviation of the vertex.
Thus this energy function requires me to iterate over each point in the mesh, and the mesh could have more than 2k such points. In Tensorflow, there are two ways to do this. I can have 2 tensors of shape (N,3), one representing the points of source and the other of the deformed mesh.
Do it purely using Tensorflow tensors. That is, iterate over elements of the above tensors using tf.gather and perform the computation on each point using only existing TF operations. This method, would be extremely slow. I've tried to define loss functions that iterate over 1000s of points before, and the graph construction itself takes too much time to be practical.
Add a new TF OP as explained in the TF documentation here . This involves writing the function in CPP (and Cuda, for GPU support), and registering the new OP with TF.
The first method is easy to write, but impractically slow. The second method is a pain to write.
I've used TF for 3 years, and have never used PyTorch before, but at this point I'm considering switching to it, if it offers a better alternative for such cases.
Does PyTorch have a way of implementing such loss functions both easily and performs as fast as it would on GPU. i.e, A pythonic way of writing my own loss functions that runs on GPU, without any C or Cuda code on my part?

As far as I understand, you are essentially asking if this operation can be vectorized. The answer is no, at least not fully, because svd implementation in PyTorch is not vectorized.
If you showed the tensorflow implementation, it would help in understanding your starting point. I don't know what you mean by finding the rotation matrix of the vertex, but I would guess this can be vectorized. This would mean that svd is the only non-vectorized operation and you could perhaps get away with writing just a single custom OP, that is the vectorized svd - which is likely quite easy, because it would amount to calling some library routines in a loop in C++.
Two possible sources of problems I see are
if the neighborhoods of N(i) in equation 7 can be of significantly different sizes (which would mean that the covariance matrices are of different sizes and vectorization would require some dirty tricks)
the general problem of dealing with meshes and neighborhoods could be difficult. This is an innate property of irregular meshes, but PyTorch has support for sparse matrices and a dedicated package torch_geometry, which at least helps.

Is sklearn.manifold.MDS comparable with MATLAB's mdscale called with the metricsstress parameter?

I am recreating a project that uses multidimensional scaling (MDS) to visualise the data in the final stage. Specifically, the original work uses MATLAB's mdscale with the metricsstress parameter which according to the documentation uses 'Squared stress, normalized with the sum of 4th powers of the dissimilarities'.
My preferred environment is python and the only implementation of MDS I'm aware of is sklearn.manifold.MDS which uses SMACOF. Here stress as the 'sum of squared distance of the disparities and the distances for all constrained points' but nothing is said about the normalisation.
My question is: were I to use the sklearn implementation in place of the mdscale one, would the results be comparable?

Vector quantization for categorical data

Software for vector quantization usually works only on numerical data. One example of this is Python's scipy.cluster.vq.vq (here), which performs vector quantization. The numerical data requirement also shows up for most clustering software.
Many have pointed out that you can always convert a categorical variable to a set of binary numeric variables. But this becomes awkward when working with big data where an individual categorical variable may have hundreds or thousands of categories.
The obvious alternative is to change the distance function. With mixed data types, the distance from an observation to a "center" or "codebook entry" could be expressed as a two-part sum involving (a) the usual Euclidean calculation for the numeric variables and (b) the sum of inequality indicators for categorical variables, as proposed here on page 125.
Is there any open-source software implementation of vector quantization with such a generalized distance function?

For machine learning and clustering algorithms you can also find useful scikit-learn. To achieve what you want, you can have a look to their implementation of DBSCAN.
In their documentation, you can find:
sklearn.cluster.dbscan(X, eps=0.5, min_samples=5, metric='minkowski', algorithm='auto', leaf_size=30, p=2, random_state=None)
Here X can be either your already computed distance matrix (and passing metric='precomputed') or the standard samples x features matrix, while metric= can be a string (with the identifier of one of the already implemented distance functions) or a callable python function that will compute distances in a pair-wise fashion.
If you can't find the metric you want, you can always program it as a python function:
def mydist(a, b):
return a - b # the metric you want comes here
And call dbscan with metric=mydist. Alternatively, you can calculate your distance matrix previously, and pass it to the clustering algorith.
There are some other clustering algorithms in the same library, have a look at them here.

You cannot "quantize" categorial data.
Recall definitions of quantization (Wiktionary):
To limit the number of possible values of a quantity, or states of a system, by applying the rules of quantum mechanics
To approximate a continuously varying signal by one whose amplitude can only have a set of discrete values
In other words, quantization means converting a continuous variable into a discrete variable. Vector quantization does the same, for multiple variables at the same time.
However, categorial variables already are discrete.
What you seem to be looking for is a prototype-based clustering algorithm for categorial data (maybe STING and COOLCAT? I don't know if they will produce prototypes); but this isn't "vector quantization" anymore.
I believe that very often, frequent itemset mining is actually the best approach to find prototypes/archetypes of categorial data.
As for clustering algorithms that allow other distance functions - there are plenty. ELKI has a lot of such algorithms, and also a tutorial on implementing a custom distance. But this is Java, not Python. I'm pretty sure at least some of the clustering algorithms in scipy to allow custom distances, too.
Now pythons scipy.cluster.vq.vq is really simple code. You do not need a library for that at all. The main job of this function is wrapping a C implementation which runs much faster than python code... if you look at the py_vq version (which is used when the C version cannot be used), is is really simple code... essentially, for every object obs[i] it calls this function:
code[i] = argmin(np.sum((obs[i] - code_book) ** 2, 1))
Now you obviously can't use Euclidean distance with a categorial codebook; but translating this line to whatever similarity you want is not hard.
The harder part usually is constructing the codebook, not using it.

How can Latent Semantic Indexing be used for feature selection?

I am studying some machine-learning and I have come across, in several places, that Latent Semantic Indexing may be used for feature selection. Can someone please provide a brief, simplified explanation of how this is done? Ideally both theoretically and in commented code. How does it differ from Principal Component Analysis?
What language it is written in doesn't really worry me, just that I can understand both code and theory.

LSA is conceptually similar to PCA, but is used in different settings.
The goal of PCA is to transform data into new, possibly less-dimensional space. For example, if you wanted to recognize faces and use 640x480 pixel images (i.e. vectors in 307200-dimensional space), you would probably try to reduce this space to something reasonable to both - make it computationally simpler and make data less noisy. PCA does exactly this: it "rotates" axes of your high-dimensional space and assigns "weight" to each of new axes, so that you can throw away least important of them.
LSA, on other hand, is used to analyze semantic similarity of words. It can't handle images, or bank data, or some other custom dataset. It is designed specifically for text processing, and works specifically with term-document matrices. Such matrices, however, are often considered too large, so they are reduced to form lower-rank matrices in a way very similar to PCA (both of them use SVD). Feature selection, though, is not performed here. Instead, what you get is feature vector transformation. SVD provides you with some transformation matrix (let's call it S), which, being multiplied by input vector x gives new vector x' in a smaller space with more important basis.
This new basis is your new features. Though, they are not selected, but rather obtained by transforming old, larger basis.
For more details on LSA, as long as implementation tips, see this article.

How to use scikit-learn's SVM with histograms as features?

I wish to use scikit-learn's SVM with a chi-squared kernel, as shown here. In this scenario, the kernel is on histograms, which is what my data is represented as. However, I can't find an example of these used with histograms. What is the proper way to do this?
Is the correct approach to just treat the histogram as a vector, where each element in the vector corresponds to a bin of the histogram?
Thank you in advance

There is an example of using an approximate feature map here. It is for the RBF kernel but it works just the same.
The example above uses "pipeline" but you can also just apply the transform to your data before handing it to a linear classifer, as AdditiveChi2Sampler doesn't actually fit to the data in any way.
Keep in mind that this is just and approximation of the kernel map (that I found to work quite well) and if you want to use the exact kernel, you should go with ogrisel's anwser.

sklearn.svm.SVC accepts custom kernels in 2 manners:
arbitrary python functions passed as kernel argument to the constructor
precomputed kernel matrix passed as first argument to fitand kernel=precomputed in the constructor
The former can be much slower but does not require to allocate the whole kernel matrix in advance (which can be prohibitive for large n_samples).
The are more details and links to examples in the documentation on custom kernels.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.