Using sklearn with complex values

Using sklearn with complex values - python

I am trying to perform a kmeans in Python using scikit-learn. The thing is my data are complex values and Python doesn't like it.
Is there any mean to use sklearn with complex values ?

It depends on what those imaginary numbers represent to you.
If you want to cluster them in a certain sense you have to define a metric.
You can choose to perform a clustering based on theyr real and imaginary coordinates or you can use the absolute value as value for the clustering. Otherwise you can convert the imaginary representation in the polar form and so cluster them based on the modulus and angular argument.

Thank you for your answers. Actually the imaginary numbers doesn't represent anything in particular to me. Indeed, I am performing a k-means on the eigenvalues of the laplacian matrix of my dataset. I tried to use the absolute value. The thing is the more the number of clusters is important, the greater the inertia will be. Then, I have an increasing inertia with the elbow method. Is it normal ?

Related

Find the appropriate polynomial fit for data in Python

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!

To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

Problems with computing the entropy of random variables via SciPy (stats)

Recently I've been trying to figure out how to calculate the entropy of a random variable X using
sp.stats.entropy()
from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values
pk
and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do
asset1/np.sum(asset1)
where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?
Furthermore, I've been trying to look for a Python package for countless hours which is solely dedicated to the calculation of random variable entropies, joint entropies, mutual information etc. as an alternative to SciPy's entropy option (simply to compare) but most seem to be outdated (I currently have Python 3.5), hence does anyone know of any good package which is compatible with my current version of Python? I know R seems to have a very compact one.
Any kind of help would be highly appreciated. Thank you very much in advance!
EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.

For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.
The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.
I have implementations for both on my github:
https://github.com/paulbrodersen/entropy_estimators
The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.

Vector quantization for categorical data

Software for vector quantization usually works only on numerical data. One example of this is Python's scipy.cluster.vq.vq (here), which performs vector quantization. The numerical data requirement also shows up for most clustering software.
Many have pointed out that you can always convert a categorical variable to a set of binary numeric variables. But this becomes awkward when working with big data where an individual categorical variable may have hundreds or thousands of categories.
The obvious alternative is to change the distance function. With mixed data types, the distance from an observation to a "center" or "codebook entry" could be expressed as a two-part sum involving (a) the usual Euclidean calculation for the numeric variables and (b) the sum of inequality indicators for categorical variables, as proposed here on page 125.
Is there any open-source software implementation of vector quantization with such a generalized distance function?

For machine learning and clustering algorithms you can also find useful scikit-learn. To achieve what you want, you can have a look to their implementation of DBSCAN.
In their documentation, you can find:
sklearn.cluster.dbscan(X, eps=0.5, min_samples=5, metric='minkowski', algorithm='auto', leaf_size=30, p=2, random_state=None)
Here X can be either your already computed distance matrix (and passing metric='precomputed') or the standard samples x features matrix, while metric= can be a string (with the identifier of one of the already implemented distance functions) or a callable python function that will compute distances in a pair-wise fashion.
If you can't find the metric you want, you can always program it as a python function:
def mydist(a, b):
return a - b # the metric you want comes here
And call dbscan with metric=mydist. Alternatively, you can calculate your distance matrix previously, and pass it to the clustering algorith.
There are some other clustering algorithms in the same library, have a look at them here.

You cannot "quantize" categorial data.
Recall definitions of quantization (Wiktionary):
To limit the number of possible values of a quantity, or states of a system, by applying the rules of quantum mechanics
To approximate a continuously varying signal by one whose amplitude can only have a set of discrete values
In other words, quantization means converting a continuous variable into a discrete variable. Vector quantization does the same, for multiple variables at the same time.
However, categorial variables already are discrete.
What you seem to be looking for is a prototype-based clustering algorithm for categorial data (maybe STING and COOLCAT? I don't know if they will produce prototypes); but this isn't "vector quantization" anymore.
I believe that very often, frequent itemset mining is actually the best approach to find prototypes/archetypes of categorial data.
As for clustering algorithms that allow other distance functions - there are plenty. ELKI has a lot of such algorithms, and also a tutorial on implementing a custom distance. But this is Java, not Python. I'm pretty sure at least some of the clustering algorithms in scipy to allow custom distances, too.
Now pythons scipy.cluster.vq.vq is really simple code. You do not need a library for that at all. The main job of this function is wrapping a C implementation which runs much faster than python code... if you look at the py_vq version (which is used when the C version cannot be used), is is really simple code... essentially, for every object obs[i] it calls this function:
code[i] = argmin(np.sum((obs[i] - code_book) ** 2, 1))
Now you obviously can't use Euclidean distance with a categorial codebook; but translating this line to whatever similarity you want is not hard.
The harder part usually is constructing the codebook, not using it.

alternative similarity measure in DBSCAN?

I test my image set on DBSCAN algorithm in scikit-learn python module . There are alternatives for similarity computing:
# Compute similarities
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
A weighted measure or something like that i could try, examples?

There exists a generalization of DBSCAN, known as "Generalized DBSCAN".
Actually for DBSCAN you do not even need a distance. Which is why it actually does not make sense to compute a similarity matrix in the first place.
All you need is a predicate "getNeighbors", that computes objects you consider as neighbors.
See: in DBSCAN, the distance is not really used, except to test whether an object is a neighbor or not. So all you need is this boolean decision.
You can try the following approach: initialize the matrix with all 1s.
For any two objecs that you consider similar for your application (we can't help you a lot on that, without knowing your application and data), fill the corresponding cells with 0.
Then run DBSCAN with epsilon = 0.5, and obviously DBSCAN will consider all the 0s as neighbors.

You can use whatever similarity matrix you like. It just need to based on a valid distance (symmetric, positive semi-definite).

I believe DBSCAN estimator wants distances and not similarities. But again when it comes to strings it will require a similarity matrix, which can even be a line of code for matching the equality between two strings. Therefore it depends on you how you use the similarity matrix and distinguish between a neighbor and non neighbor objects.

Curve Fitting with Known Integrals Python

I have some data that are the integrals of an unknown curve within bins. For your interest, the data is ocean wave energy and the bins are for directions, e.g. 0-15 degrees. If possible, I would like to fit a curve on to the data that conserves the integrals within the bins. I've tried sketching it on a notepad with a pencil and it seems like it could be possible. Does anyone know of any curve-fitting tool in Python to do this, for example in the scipy interpolation sub-package?
Thanks in advance
Edit:
Thanks for the help. If I do it, it looks like I will try the method that is recommended in section 4 of this paper: http://journals.ametsoc.org/doi/abs/10.1175/1520-0485%281996%29026%3C0136%3ATIOFFI%3E2.0.CO%3B2. In theory, it basically uses matrices to make some 'fake' data from the known integrals between each band. When plotted, this data then produces an interpolated line graph that preserves the integrals.

It's a little outside my bailiwick, but I can suggest having a look at SciKits to see if there's anything there that might be useful. Other packages to browse would be pandas and StatsModels. Good luck!

If you have a curve f(x) which is an approximation to the integral of another curve g(x), i.e. f=int(g,x) then the two are related by the Fundamental theorem of calculus, that is, your original function is the derivative of the first curve g = df/dx. As such you can use numpy.diff or any of the higher order methods to approximate df/dx to obtain an estimate of your original curve.

One possibility: calculate the cumulative sum of the bin volumes (np.cumsum), fit an interpolating spline to it, and then take the derivative to get the curve.
scipy splines have methods to calculate the derivatives.
The only limitation, in case it is relevant in your case, the spline through the cumulative sum might not be monotonic, and the derivative might be negative over some intervals.
I guess that the literature on smoothing a histogram looks at similar constraints on the volume of the integral/bin, but I don't have any references ready.

1/ fit2histogram
Your question is about fitting an histogram. I just came through documentation for some Python package for Multi-Variate Pattern Analysis, PyMVPA, and some function for histogram fitting is proposed. An example is here: PyMVPA.
However, I guess that set of available distributions is limited to famous distributions.
2/ integral computation
As already mentionned, next solution is to approximate integral value, and to fit a model to the resulting set of data. Either you know explicit expression for the derivative, or you use computational derivation: finite difference, analytical method.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.