Missing values are a common problem in data analysis. One common strategy seems to be that missing values are replaced by values randomly sampled from the distribution of existing values.
Is there Python library code that conveniently performs this preprocessing step on a data frame? As far as I see the sklearn.preprocessing module does not offer this strategy.
To sample from a distribution of existing values you need to know the distribution. If the distribution is not known you can use kernel density estimation to fit it. This blog post has a nice overview of kernel density estimation implementations for Python: http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/.
There is an implementation in scikit-learn (see http://scikit-learn.org/stable/modules/density.html#kernel-density); sklearn's KernelDensity has .sample() method. There is also a kernel density estimator in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html); it supports categorical features.
Another method is to choose random existing values, without trying to generate values not seen in a dataset. The problem with this solution is that value could depend on other values in the same row, and random.sample without taking this in account may produce unrealistic examples.
Related
I need to fit a data set, which I suspect should be described by the convolution of a chi2 and a normal distribution, but the specific distributions are not the relevant matter. I found this thread, where the accepted answer manages to convolute the two functions. I haven't managed to come up with a solution to use this for fitting. Is there a way of using a convolution of two continuous distributions for fitting in Python? I added a plot of the data and the data can be found here.
I have the dataset in my disposal which consists of around 500 columns which I need to explore and keep only relevant columns. Pandas info(verbose = True) method does not even display this number properly. I also used missingno library to visualise nulls. However, it uses a lot of RAM. What to use instead of matplotlib here?
How do you approach datasets with a lot of features (more than 100)? Any useful workflow to eliminate useless features? How to use info() or any alternative?
Yeah, also used expand options to view everything. Question here is how to set it locally?
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
UPDATE:
Methods or solutions to explore initial raw data are of interest. For instance one cell script which summarises numerical features as distributions, categorical as counts and possibly something else. This can be written by myself, however, maybe there is a library or just your function which does so?
Regarding the issue of useless features, you could easily estimate some metrics associated with feature effectiveness and filter it out using some threshold. Check out the sklearn feature selection docs.
Of course before doing that you'll have to make sure features are numeric and their representation is fit for the tests of your choice. To do that I suggest you check out sklearn pipelines (optional) and preprocessing docs.
Before estimating feature usefulness, make sure you cover missing data handling, encoding categorical variables and feature scaling.
You can use XGBoost's feature_importance attribute. Though, you first need to train your data using XGB & then using feature_importance, consider only important features (by setting a threshold of your choice)
Dimension reduction can come handy using PCA or some other algorithm.
I have a data set with a dozen dimensions (columns) and about 200 observations (rows). This dataset has been normalized using quantile_transform_normalize. (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.
However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1. According to the documentation:
Noisy samples are given the label -1.
I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.
Here is the code I am using for clustering:
covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)
And that's all. I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.
What could be causing this issue?
You need to choose the parameter eps, too.
DBSCAN results depend on this parameter very much. You can find some methods for estimating it in literature.
IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).
200 instances probably is too small to reliably measure density, in particular with a dozen variables.
I want to find the distribution that best fit some data. This would typically be some sort of measurement data, for instance force or torque.
Ideally I want to run Anderson-Darling with multiple distributions and select the distribution with the highest p-value. This would be similar to the 'Goodness of fit' test in Minitab. I am having trouble finding a python implementation of Anderson-Darling that calculates the p-value.
I have tried scipy's stats.anderson() but it only returns the AD-statistic and a list of critical values with the corresponding significance levels, not the p-value itself.
I have also looked into statsmodels, but it seems to only support the normal distribution. I need to compare the fit of several distributions (normal, weibull, lognormal etc.).
Is there an implementation of the Anderson-Darling in python that returns p-value and supports nonnormal distributions?
I would just rank distributions by the goodness-of-fit statistic and not by p-values. We can use the Anderson-Darling, Kolmogorov-Smirnov or similar statistic just as distance measure to rank how well different distributions fit.
background:
p-values for Anderson-Darling or Kolmogorov-Smirnov depend on whether the parameters are estimated or not. In both cases the distribution is not a standard distribution.
In some cases we can tabulate or use a functional approximation to tabulated values. This is the case when parameters are not estimated and if the distribution is a simple location-scale family without shape parameters.
For distributions that have a shape parameter, the distribution of the test statistic that we need for computing the p-values depends on the parameters. That is we would have to compute different distributions or tabulated p-values for each set of parameters, which is impossible.
The only solution to get p-values in those cases is either by bootstrap or by simulating the test statistic for the specific parameters.
The technical condition is whether the test statistic is asymptotically pivotal which means that the asymptotic distribution of the test statistic is independent of the specific parameters.
Using chisquare test on binned data requires fewer assumption, and we can compute it even when parameters are estimated. (Strictly speaking this is only true if the parameters are estimated by MLE using the binned data.)
You can check this page base on OpenTURNS library.
Basically, if x is a Python list or a Numpy array,
import openturns as ot
sample = ot.Sample(x)
the call the Anderson Darling method
test_result = ot.NormalityTest.AndersonDarlingNormal(sample)
The p_value is obtained by calling test_result.getPValue()
You could use multiple distributions, it just needs to be Callable. See below how I called gamma.
from statsmodels.stats.diagnostic import anderson_statistic as ad_stat
from scipy import stats
result = ad_stat(df[['Total']], dist= stats.gamma)
You could call any distribution listed in Scipy: https://docs.scipy.org/doc/scipy/reference/stats.html
See source code for more info: https://www.statsmodels.org/stable/_modules/statsmodels/stats/_adnorm.html
I have images that I am segmenting using a gaussian mixture model from scikit-learn. Some images are labeled, so I have a good bit of prior information that I would like to use. I would like to run a semi-supervised training of a mixture model, by providing some of the cluster assignments ahead of time.
From the Matlab documentation, I can see that Matlab allows initial values to be set. Are there any python libraries, especially scikit-learn approaches that would allow this?
The standard GMM does not work in a semi-supervised fashion. The initial values you mentioned is likely the initial values for the mean vectors and covariance matrices for the gaussians which will be updated by the EM algorithm.
A simple hack will be to group your labeled data based on their labels and individually estimate mean vectors and covariance matrices for them and pass these as the initial values to your MATLAB function (scikit-learn does not allow this as far as I'm aware). Hopefully this will position your Gaussians at the "correct locations". The EM algorithm will then take it from there to adjust these parameters.
The downside of this hack is that it does not guarantee that it will respect your true label assignment, hence even if a data point is assigned a particular cluster label, there is a chance that it might be re-assigned to another cluster. Also, noise in your feature vectors or labels could also cause your initial Gaussians to cover a much larger region than it is suppose to, hence wrecking havoc on the EM algorithm. Also, if you do not have sufficient data points for a particular cluster, your estimated covariance matrices might be singular, hence breaking this trick altogether.
Unless it is a must for you to use GMM to cluster your data (for e.g., you know for sure that gaussians model your data well), then perhaps you can just try the semi-supervised methods in scikit-learn . These will propagate the labels based on feature similarities to your other data point. However, I doubt this can handle large dataset as it requires the graph laplacian matrix to be built from pairs of samples, unless there is some special implementation trick to handle this in scikit-learn.