I need to fit a data set, which I suspect should be described by the convolution of a chi2 and a normal distribution, but the specific distributions are not the relevant matter. I found this thread, where the accepted answer manages to convolute the two functions. I haven't managed to come up with a solution to use this for fitting. Is there a way of using a convolution of two continuous distributions for fitting in Python? I added a plot of the data and the data can be found here.
Related
Is there an easy(ish) way to fit a two phase Coxian distribution, preferable in R or if necessary Python? This is a distribution with two transient states in sequence, each described by an exponential distribution, that each can lead to the absorbing state with some probability. I have some real world data that I think is best described by this distribution and I would like to be able to estimate the exponential parameters of the two phases, ideally as a linear function of some covariates I have. If there is a package or library or any sort of resources about fitting a model like this I would really appreciate it. Thank you for your time.
What happens after I create a dimensionality reduction algorithm (PCA) that has produced a matrix W?
How do I now use it to predict real time data?
Do I need to create a User interface or what?
If thats the case, how and where?
Im completely lost
Depends on what you wanted to achieve by doing PCA. One use case can be Clustering. If you just look at the summary of your PCA model you'll be able to see the list of Principal Components and the Proportion of variance explained by each Principal Component. You can choose components which explain most of the variance (normally the cumulative proportion of variance explained by PCs should be >80%) in your model. You can plot a Scree plot and look for break in the graph to determine number of Components to use in Clustering. Input those PCs in your clustering model and you might see some good clustering results.
For your second query follow this Link
And you don't need to create any user interface right now. Just follow the above link and you should get some understanding around PCA and as to how to use it for predictive modeling.
Missing values are a common problem in data analysis. One common strategy seems to be that missing values are replaced by values randomly sampled from the distribution of existing values.
Is there Python library code that conveniently performs this preprocessing step on a data frame? As far as I see the sklearn.preprocessing module does not offer this strategy.
To sample from a distribution of existing values you need to know the distribution. If the distribution is not known you can use kernel density estimation to fit it. This blog post has a nice overview of kernel density estimation implementations for Python: http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/.
There is an implementation in scikit-learn (see http://scikit-learn.org/stable/modules/density.html#kernel-density); sklearn's KernelDensity has .sample() method. There is also a kernel density estimator in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html); it supports categorical features.
Another method is to choose random existing values, without trying to generate values not seen in a dataset. The problem with this solution is that value could depend on other values in the same row, and random.sample without taking this in account may produce unrealistic examples.
I have images that I am segmenting using a gaussian mixture model from scikit-learn. Some images are labeled, so I have a good bit of prior information that I would like to use. I would like to run a semi-supervised training of a mixture model, by providing some of the cluster assignments ahead of time.
From the Matlab documentation, I can see that Matlab allows initial values to be set. Are there any python libraries, especially scikit-learn approaches that would allow this?
The standard GMM does not work in a semi-supervised fashion. The initial values you mentioned is likely the initial values for the mean vectors and covariance matrices for the gaussians which will be updated by the EM algorithm.
A simple hack will be to group your labeled data based on their labels and individually estimate mean vectors and covariance matrices for them and pass these as the initial values to your MATLAB function (scikit-learn does not allow this as far as I'm aware). Hopefully this will position your Gaussians at the "correct locations". The EM algorithm will then take it from there to adjust these parameters.
The downside of this hack is that it does not guarantee that it will respect your true label assignment, hence even if a data point is assigned a particular cluster label, there is a chance that it might be re-assigned to another cluster. Also, noise in your feature vectors or labels could also cause your initial Gaussians to cover a much larger region than it is suppose to, hence wrecking havoc on the EM algorithm. Also, if you do not have sufficient data points for a particular cluster, your estimated covariance matrices might be singular, hence breaking this trick altogether.
Unless it is a must for you to use GMM to cluster your data (for e.g., you know for sure that gaussians model your data well), then perhaps you can just try the semi-supervised methods in scikit-learn . These will propagate the labels based on feature similarities to your other data point. However, I doubt this can handle large dataset as it requires the graph laplacian matrix to be built from pairs of samples, unless there is some special implementation trick to handle this in scikit-learn.
I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.
Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.
For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?