Estimating the Similarity between Two Unpaired Datasets

Estimating the Similarity between Two Unpaired Datasets - python

I'm trying to compare the data (black) and the model (color). [Fig. 1]
There is another example [Fig. 2]. The data set and the model are different for Fig. 1 and Fig. 2.
In both the cases, it appears that there are overlaps between the model and data, however, the overlap/matching is better for Fig. 2. I would like to quantify the correlation of the data and the model for both the cases in order to distinguish between the "goodness of fit" of both the figures. I was wondering which (statistical) method I should use to estimate the correlation quantitatively.

You could start by calculating the center of gravity for each dataset using numpy.mean and compare how close their are to one another.
Next step is to check the if each center is inside the uncertainty ellipse (http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/) of the other dataset.
Finally, I would recommend to using hypotheses testing like student's test or f-test. There are some methods in scipy for these kind of test, just look at the documentation

Related

Interval Prediction for a Time Series | Anomaly in Time Series

I have a time series in which i am trying to detect anomalies. The thing is that with those anomalies i want to have a range for which the data points should lie to avoid being the anomaly point. I am using the ML .Net algorithm to detect anomalies and I have done that part but how to get range?
If by some way I can get the range for the points in time series I can plot them and show that the points outside this range are anomalies.
I have tried to calculate the range using prediction interval calculation but that doesn't work for all the data points in the time series.
Like, assume I have 100 points, I take 100/4, i.e 25 as the sliding window to calculate the prediction interval for the next point, i.e 26th point but the problem then arises is that how to calculate the prediction interval for the first 25 points?

A method operating on a fixed-length sliding window generally needs that entire window to be filled, in order to make an output. In that case you must pad the input sequence in the beginning if you want to get predictions (and thus anomaly scores) for the first datapoints. It can be hard to make that padded data realistic, however, which can lead to poor predictions.
A nifty technique is to compute anomaly scores with two different models, one going in the forward direction, the other in the reverse direction, to get scores everywhere. However now you must decide how to handle the ares where you have two sets of predictions - to use min/max/average anomaly score.
There are some models that can operate well on variable-length inputs, like sequence to sequence models made with Recurrent Neural Networks.

How can I statistically compare a lightcurve data set with the simulated lightcurve?

With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.

If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html

clustering algorithm with minimum number of points

I am trying to separate a data set that has 2 clusters that do not overlap in anyway and a single data point that is away from these two clusters.
When I use kmeans() to get the 2 clusters, it splits one of the "valid" cluster into half and considers the single data point as a separate cluster.
Is there a way to specify minimum number of points for this? I am using MATLAB.

There are several solutions:
Easy: try with 3 clusters;
Easy: remove the single data point (that you can detect as an outlier with any outlier detection technique;
To be tried: Use a k-medoids approach instead of k-means. This sometimes helps getting rid of outliers.
More complicated but surely works: Perform spectral clustering. This helps you get over the main issue of k-means, which is the brutal use of the euclidian distance
More explanations on the inadequate behaviour of k-means can be found on Cross Validated site (see here for instance).

Python plotting a polynomial with fixed end point

I require a python solution to force a polynomial to end at a specific point. I have read the solutions offered here: How to do a polynomial fit with fixed points to a similar question but have been unable to get any of these methods working on my data set as are not defining end points but locations to for a polynomial to pass through.
I therefore require a solution to force a polynomial curve to end at a specific location.
To put this in context the example that I need this for is shown in the image below, I require a line of best fit for the data below, the green points represent the raw data and the pink points are the mean of the green points for every x value. The best fit should be a 3rd order polynomial until the data becomes a horizontal linear line. The black line is my current attempt at a line of best fit using np.ployfit(), I have defined the polynomial to only plot until the location where I would then start the linear best fit line but as you can see the tail of the polynomial is far to low and hence I want to force it to end / go through a specific point.
I am open to all options to get a nice mathematically sensible best fit as have been banging my head against this problem for too long now.

Using a logistic sigmoid instead of a polynomial:
Formula and parameters, generated from some of your sample data points(taken from the picture):
where S(x) is the sigmoid function.

A distinct approach, since you seem to want to identify outliers in the horizontal dimension.
Stratify your data by power, say into 10-kW intervals, so that each interval contains 'enough' points to use some robust estimator of their dispersion. In each stratum discard upper extreme observations until the robust estimate falls to a stable value. Now use the maximum values for the strata as the criteria against which to gauge whether any given device is to be regarded as 'inefficient'.

Curve_fit not converging means...?

I need to crossmatch a list of astronomical coordinates with different catalogues, and I want to decide a maximum radius for the crossmatch. This will avoid mismatches between my list and the catalogues.
To do this, I compute the separation between the best match with the catalogue for each object in my list. My initial list is supossed to be the position of a known object, but it could happend that it is not detected in the catalog, and my coordinates may suffer from small offsets.
They way I am computing the maximum radius is by fitting the gaussian kernel density of the separation with a gaussian, and use the center + 3sigmas value. The method works nicely for most of the cases, but when a small subsample of my list has an offset, I have two gaussians instead. In these cases, I will specify the max radius in a different way.
My problem is that when this happens, curve_fit can't normally do the fit with one gaussian. For a scientific publication, I will need to justify the "no fit" in curve_fit, and in which cases the "different way" is used. Could someone give me a hand on what this means in mathematical terms?

There are varying lengths to which you can go justifying this or that fitting ansatz --- which strongly depends on the details of your specific case (eg: why do you expect a gaussian to work in a first place? to what depth you need/want to delve into why exactly a certain fitting procedure fails and what exactly is a fail etc).
If the question is really about the curve_fit and its failure to converge, then show us some code and some input data which demonstrate the problem.
If the question is about how to evaluate the goodness-of-fit, you're best off going back to the library and picking a good book on statistics.
If all you look for is way of justifying why in a certain case a gaussian is not a good fitting ansatz, one way would be to calculate the moments: for a gaussian distribution 1st, 2nd, 3rd and higher moments are related to each other in a very precise way. If you can demonstrate that for your underlying data the relation between moments is very different, it sounds reasonable that these data can't be fit by a gaussian.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Estimating the Similarity between Two Unpaired Datasets - python

Related

Interval Prediction for a Time Series | Anomaly in Time Series

How can I statistically compare a lightcurve data set with the simulated lightcurve?

clustering algorithm with minimum number of points

Python plotting a polynomial with fixed end point

Curve_fit not converging means...?

Categories

Resources