GMM clustering algorithm with equal weight and shared diagonal covariance - python

I'm looking for a Gaussian mixture model clustering algorithm that would allow me to set equal component weights and shared diagonal covariances. I need to analyze a set of data and I don't have the time to try to write the code myself.

In python you can use scikit's GMM. It's easy to do, see the doc:
http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.mixture.GMM.html
Re your specific needs:
thegmm = GMM(cvtype='tied', params='mc')
thegmm.fit(mydata)
Meaning:
shared diagonal covariances: use covariance_type='tied' in the constructor
equal component weights: use params='mc' in the constructor (rather than the default 'wmc' which lets weights update).
Actually, I'm not sure if 'tied' implies diagonal covariances. It looks like you can choose 'tied' or 'diagonal' but not both, according to the doc. Anyone confirm?

Looks like the standard Matlab GMM tool will work, set the 'CovType' option to diagonal and the 'SharedCov' option to true

Related

How to check in easy way non-linear relationships using Python?

I have dataset in pandas DataFrame. I build a function which returns me a dataframe which looks like this:
Feature_name_1 Feature_name_2 corr_coef p-value
ABC DCA 0.867327 0.02122
So it's taking independent variables and returns correlation coefficient of them.
Is there is any easy way I can check in this way non-linear relationship?
In above case I used scipy Pearson correlation but I cannot find how to check non-linear? I found only more sophisticated methods and I would like have something easy to implement as above.
It will be enough if method will be easy to implement it's not necessary have to be from scipy on other specific packages
Regress your dependent variables on your independent variables and examine the residuals. If your residuals show a pattern there is likely a nonlinear relationship.
It may also be the case that your model is missing a cross term or could benefit from a transformation or something along those lines. I might be wrong but I'm not aware of a cut and dry test for non linearity.
Quick google search returned this which seems like it might be useful for you.
https://stattrek.com/regression/residual-analysis.aspx
Edit: Per the comment below, this is very general method that helps verify the linear regression assumptions.

Matching Property on Heterogenous Data using Deep Learning

The issue I face is that I want to match properties (houses/apartments etc) that are similar to each other (e.g. longitude and latitude (numerical), bedrooms (numerical), district (categorial), condition (categorical) etc.) using deep learning. The data is heterogenous because we mix numerical and categorical data and the problem is unsupervised because we don’t use any labels.
My goal is to get a measure for how similar properties are so I can find the top matches for each target property. I could use KNN, but I want to use something that allows me to find embeddings and that uses deep learning.
I suppose I could determine a mixed distance measure such as the Gower Distance as the loss function, but how would I go about setting up a model that determines the, say, the top 10 matches for each target property in my sample?
Any help or points to similar problem sets (Kaggle, notebooks, github) would be very appreciated.
Thanks
Given that you want an unsupervised approach, you could try using an auto-encoder. I have found Variational Auto-Encoders (VAEs) to be pretty good for other problems. The learned embedding should respect distance in the input space to some extent, but you might need to modify the loss function slightly if you want examples to be separated in a specific way.
To get the top k, you can just encode each example, compute a distance matrix and take the top k in each row (or col).
I have an implementation of VAEs (and others) in Pytorch: here for your reference, obviously you will need a different network architecture to handle the categorical aspects etc.
Hope this helps!

Find the appropriate polynomial fit for data in Python

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!
To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

What does the parameter "mds" mean in the pyLDAvis.sklearn.prepare () - function?

I want to visualize the topic modeling made with the LDA-algorithm. I use the python module called "pyldavis" and as environment the jupyter notebook.
import pyLDAvis.sklearn
...
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')
It does work fine, but I don't really understand the mds-parameter... Even after reading the documentation:
mds :function or a string representation of function
A function that takes topic_term_dists as an input and outputs a n_topics by 2 distance matrix. The output approximates the distance between topics. See js_PCoA() for details on the default function. A string representation currently accepts pcoa (or upper case variant), mmds (or upper case variant) and tsne (or upper case variant), if sklearn package is installed for the latter two.
Does somebody know what the differences btw. mds='pcoa', mds='mmds', mds='tsne'?
Thanks!
Dimension reduction via Jensen-Shannon Divergence &
pcoa:Principal Coordinate Analysis(aka Classical Multidimensional Scaling)
mmds:Metric Multidimensional Scaling
tsne:t-distributed Stochastic Neighbor Embedding
Simply put: text data, when transformed into numeric tabular data, usually is high-dimensional. On the other hand, visualizations on a screen is two-dimensional (2D). Thus, a method of dimension reduction is required to bring the number of dimensions down to 2.
mds stands for multidimensional scaling. The possible values of that argument are:
mmds (Metric Multidimensional Scaling),
tsne (t-distributed Stochastic Neighbor Embedding), and
pcoa (Principal Coordinate Analysis),
All of them are dimension reduction methods.
Another method of dimension reduction that may be more familiar to you but not listed above is PCA (principal component analysis). They all share the similar idea of reducing dimensionality without losing too much information, backed by different theories and implementations.

How to use scikit-learn's SVM with histograms as features?

I wish to use scikit-learn's SVM with a chi-squared kernel, as shown here. In this scenario, the kernel is on histograms, which is what my data is represented as. However, I can't find an example of these used with histograms. What is the proper way to do this?
Is the correct approach to just treat the histogram as a vector, where each element in the vector corresponds to a bin of the histogram?
Thank you in advance
There is an example of using an approximate feature map here. It is for the RBF kernel but it works just the same.
The example above uses "pipeline" but you can also just apply the transform to your data before handing it to a linear classifer, as AdditiveChi2Sampler doesn't actually fit to the data in any way.
Keep in mind that this is just and approximation of the kernel map (that I found to work quite well) and if you want to use the exact kernel, you should go with ogrisel's anwser.
sklearn.svm.SVC accepts custom kernels in 2 manners:
arbitrary python functions passed as kernel argument to the constructor
precomputed kernel matrix passed as first argument to fitand kernel=precomputed in the constructor
The former can be much slower but does not require to allocate the whole kernel matrix in advance (which can be prohibitive for large n_samples).
The are more details and links to examples in the documentation on custom kernels.

Categories

Resources