I have to calibrate a distance measuring instrument which gives capacitance as output, I am able to use numpy polyfit to find a relation and apply it get distance. But I need to include limits of detection 0.0008 m as it is the resolution of the instrument.
My data is:
cal_distance = [.1 , .4 , 1, 1.5, 2, 3]
cal_capacitance = [1971, 2336, 3083, 3720, 4335, 5604]
raw_data = [3044,3040,3039,3036,3033]
I need my distance values to be like .1008, .4008 that represents the limits of detection of the instrument.
I have used the following code:
coeffs = np.polyfit(cal_capacitance, cal_distance, 1)
new_distance = []
for i in raw_data:
d = i*coeffs[0] + coeffs[1]
I have a csv file and actually used a pandas dataframe with date time index to store the raw data, but for simplicity I have given a list here.
I need to include the limits of detection in the calibration process to get it right.
Limit of detection is the accuracy of your measurement (the smallest 'step' you can resolve)
polyfit gives you a 'model' of the best fit function f of the relation
distance = f(capacitance)
You use 1 as the degree of the polynomial so you're basically fitting a line.
So, first off you need to look into the accuracy of the fit: this is returned by using the 3rd parameter full=True.
(see the docs: http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html for more details)
You will get the residual of the fit.
Is it actually smaller than the LOD? Otherwise your limiting factor is the fitting
accuracy. In your particular case it looks like it is 0.00017021, so indeed below the 0.0008 LOD.
Second, why 'add' LOD to the reading? Your reading is the reading. then LOD is the +/- range the distance could really be within. Adding it to the end result does not seem to make sense here.
You should instead report the final value as 'new distance' +/- LOD.
Is your raw data all measurements of the same distance? If so, you can see that the standard deviation of this measurement using the fit is 0.0029680362423331122, ( numpy.std(new_distance) ) and range is 0.0087759439302268483, which is 10x over the LOD, so here your limiting factor really seems to be the measuring conditions.
Not to beat a dead horse, but LOD and precision are two completely different things. LOD is typically defined as three-times the standard deviation of the noise of your instrument, which would be equivalent to the minimum capacitance (or distance , which is related to capacitance here) your instrument can detect. i.e. anything less than that is equivalent to zero (more or less). But your precision is the minimum change in capacitance that can be detected by your instrument, which may or may not be less than the LOD. Such terms (in addition to accuracy) are common sources of confusion. While you may know what you are talking about when you say LOD (and everyone else may be able to understand that you really mean precision) it would be beneficial to use the proper notation. Just a thought...
I am currently gathering X gyro data of a device. I need to find the differences between them in terms of their y values in the graphs such as this one.
I have written an algorithm where it finds the min and max y value but when they have some minor fluctuations like this one my algorithm returns faulty answers. I have written it in python and it is as follows;
import numpy as np
x, y = np.loadtxt("gyroDataX.txt", unpack=True, delimiter=",")
max_value = max(y)
min_value = min(x)
borders = max_value - min_value
I need to now write and algorithm that will;
Determine the max and min y value and draw their borders.
If it sees minor fluctuations it will ignore them.
Would writing such an algorithm be possible, and if so how could I go about writing one? Is there any libraries or any reading material you would recommend? Thanks for your help.
1. Generally maths like this should be done in pure code, with the help of little or none external API's, so it's easier to debug the algorithm and processes.
2. Now for a possible answer:
Since you do not want any outliers (those buggy and irritating minor fluctuations), you need to calculate the standard deviation of your data.
What is the standard deviation, you might ask?
It represents how far your data is from the mean(Symbol: µ) (average) of your data set.
If you do not know what it is, here is quick tutorial on the standard deviation and its partner, variance.
First, the MEAN:
It should not be that hard to calculate the mean of your x and y arrays/lists . Just loop (for loop would be optimal) through the lists and add up all the values and divide it by the length of the list itself. There you have the mean.
Second, the VARIANCE(σ squared):
If you followed the website above, to calculate the variance, loop through the x and y lists again, subtract the x and y values from their respective mean to get the difference, square this difference, add all the differences up and divide by the length of the respective lists and you have the variance.
For the Standard Deviation (Symbol: σ), just take the square root of the variance.
Now the standard deviation can be used to find the real max and min (leaving out those buggy outliers).
Use this graph as an approximate reference to find where your most of your values may be:
Since your data is mostly uniform, you should get a pretty accurate answer.
Do test the different situations: σ + µ or 2σ + µ; to find the optimum max and min.
Sorry, only y now:
Sorry for the horrible representation and drawing. This is what it should like graphically is. Also do experiment by yourself with the Standard Deviation from the mean (like above; σ + µ or 2σ + µ) to find the your suited max and min.
I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!
I'm reading a book on Data Science for Python and the author applies 'sigma-clipping operation' to remove outliers due to typos. However the process isn't explained at all.
What is sigma clipping? Is it only applicable for certain data (eg. in the book it's used towards birth rates in US)?
As per the text:
quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
Why 0.74? Is there a proof for this?
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
That's it, really...
The code tries to estimate sigma using the interquartile range to make it robust against outliers. 0.74 is a correction factor. Here is how to calculate it:
p1 = sp.stats.norm.ppf(0.25) # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75) # third quartile
print(p2 - p1) # 1.3489795003921634
sig = 1 # standard deviation of the standard normal distribution
factor = sig / (p2 - p1)
print(factor) # 0.74130110925280102
In the standard normal distribution sig==1 and the interquartile range is 1.35. So 0.74 is the correction factor to turn the interquartile range into sigma. Of course, this is only true for the normal distribution.
Suppose you have a set of data. Compute its median m and its standard deviation sigma. Keep only the data that falls in the range (m-a*sigma,m+a*sigma) for some value of a, and discard everything else. This is one iteration of sigma clipping. Continue to iterate a predetermined number of times, and/or stop when the relative reduction in the value of sigma is small.
Sigma clipping is geared toward removing outliers, to allow for a more robust (i.e. resistant to outliers) estimation of, say, the mean of the distribution. So it's applicable to data where you expect to find outliers.
As for the 0.74, it comes from the interquartile range of the Gaussian distribution, as per the text.
The answers here are accurate and reasonable, but don't quite get to the heart of your question:
What is sigma clipping? Is it only applicable for certain data?
If we want to use mean (mu) and standard deviation (sigma) to figure out a threshold for ejecting extreme values in situations where we have a reason to suspect that those extreme values are mistakes (and not just very high/low values), we don't want to calculate mu/sigma using the dataset which includes these mistakes.
Sample problem: you need to compute a threshold for a temperature sensor to indicate when the temperature is "High" - but sometimes the sensor gives readings that are impossible, like "surface of the sun" high.
Imagine a series that looks like this:
thisSeries = np.array([1,2,3,4,1,2,3,4,5,3,4,5,3, 500, 1000])
Those last two values look like obvious mistakes - but if we use a typical stats function like a Normal PPF, it's going to implicitly assume that those outliers belong in the distribution, and perform its calculation accordingly:
st.norm.ppf(.975, thisSeries.mean(), thisSeries.std())
So using a two-sided 5% outlier threshold (meaning we will reject the lower and upper 2.5%), it's telling me that 500 is not an outlier. Even if I use a one-sided threshold of .95 (reject the upper 5%), it will give me 546 as the outlier limit, so again, 500 is regarded as non-outlier.
Sigma-clipping works by focusing on the inter-quartile range and using median instead of mean, so the thresholds won't be calculated under the influence of the extreme values.
thisDF = pd.DataFrame(thisSeries, columns=["value"])
quartiles = np.percentile(thisSeries, [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
queryString = '({} < #mu - {} * #sig) | ({} > #mu + {} * #sig)'.format(intermed, factor, intermed, factor)
print(mu + 5 * sig)
At factor=5, both outliers are correctly isolated, and the threshold is at a reasonable 10.4 - reasonable, given that the 'clean' part of the series is [1,2,3,4,1,2,3,4,5,3,4,5,3]. ('factor' in this context is a scalar applied to the thresholds)
To answer the question, then: sigma clipping is a method of identifying outliers which is immune from the deforming effects of the outliers themselves, and though it can be used in many contexts, it excels in situations where you suspect that the extreme values are not merely high/low values that should be considered part of the dataset, but rather that they are errors.
Here's an illustration of the difference between extreme values that are part of a distribution, and extreme values that are possibly errors, or just so extreme as to deform analysis of the rest of the data.
The data above was generated synthetically, but you can see that the highest values in this set are not deforming the statistics.
Now here's a set generated the same way, but this time with some artificial outliers injected (above 40):
If I sigma-clip this, I can get back to the original histogram and statistics, and apply them usefully to the dataset.
But where sigma-clipping really shines is in real world scenarios, in which faulty data is common. Here's an example that uses real data - historical observations of my heart-rate monitor. Let's look at the histogram without sigma-clipping:
I'm a pretty chill dude, but I know for a fact that my heart rate is never zero. Sigma-clipping handles this easily, and we can now look at the real distribution of heart-rate observations:
Now, you may have some domain knowledge that would enable you to manually assert outlier thresholds or filters. This is one final nuance to why we might use sigma-clipping - in situations where data is being handled entirely by automation, or we have no domain knowledge relating to the measurement or how it's taken, then we don't have any informed basis for filter or threshold statements.
It's easy to say that a heart rate of 0 is not a valid measurement - but what about 10? What about 200? And what if heart-rate is one of thousands of different measurements we're taking. In such cases, maintaining sets of manually defined thresholds and filters would be overly cumbersome.
I think there is a small typo to the sentence that "this final line is a strong estimate of the sample average". From the previous proof, I think the final line is a solid estimate of 1 Sigma for births if the normal distribution is followed.
I have a set of points . Their geometry (SRID: 4326) is stored in a Database.
I have been given a code that aims to cluster this points with DBSCAN. The parameters have been set as follow: eps=1000, min_points=1.
I obtain clusters that are less distant than 1000 meters. I believed that two points less distant than 1000 meters would belong to the same cluster. Is epsilon really in meters?
The code is the following:
if self.debug==True:
print 'Nbr of Points: %d'% len(X)
# print X.shape
# print dist_matrix.shape
D = distance.squareform(distance.pdist(X,'euclidean'))
# print dist_matrix
# S = 1 - (D / np.max(D))
db = DBSCAN(eps, min_samples).fit(D)
self.core_samples = db.core_sample_indices_
self.labels = db.labels
the aim is not to find another way to run it but really to understand the value of eps. What it represents in term of distance. Min_sample is set to one because I accept to have clusters with a size of 1 sample indeed.
This depends on your implementation.
Your distance function could return anything; including meters, millimeters, yards, km, miles, degrees... but you did not share what function you use for computing distance!
If I'm not mistaken, SRID: 4326 does not imply anything on distance computations.
The "haversine" used by sklearn seems to use degrees, not meters.
Either way, min_points=1 is nonsensical. The query point is included, so every point itself is a cluster. With min_points <= 2, the result of DBSCAN will be a single-linkage clustering. To get a density based clustering, you need to choose a higher value to get real density.
You may want to use ELKI's DBSCAN. According to their Java sources, their distance function uses meters, but also their R*-tree index allows accelerated range queries with this distance, which will yield a substantial speed-up (O(n log n) instead of O(n^2)).
I'm trying to write my own Python code to compute t-statistics and p-values for one and two tailed independent t tests. I can use the normal approximation, but for the moment I am trying to just use the t-distribution. I've been unsuccessful in matching the results of SciPy's stats library on my test data. I could use a fresh pair of eyes to see if I'm just making a dumb mistake somewhere.
Note, this is cross-posted from Cross-Validated because it's been up for a while over there with no responses, so I thought it can't hurt to also get some software developer opinions. I'm trying to understand if there's an error in the algorithm I'm using, which should reproduce SciPy's result. This is a simple algorithm, so it's puzzling why I can't locate the mistake.
My code:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]; num2 = pop2.shape[0];
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2))/np.sqrt( np.var(pop1)/num1 + np.var(pop2)/num2 )
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((np.var(pop1)/num1 + np.var(pop2)/num2)**(2.0))/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
After reading a bit more on the Welch's t-test, I saw that I should be using the Welch-Satterthwaite formula to calculate degrees of freedom. I updated the code above to reflect this.
With the new degrees of freedom, I get a closer result. My two-sided p-value is off by about 0.008 from the SciPy version's... but this is still much too big an error so I must still be doing something incorrect (or SciPy distribution functions are very bad, but it's hard to believe they are only accurate to 2 decimal places).
Second update:
While continuing to try things, I thought maybe SciPy's version automatically computes the Normal approximation to the t-distribution when the degrees of freedom are high enough (roughly > 30). So I re-ran my code using the Normal distribution instead, and the computed results are actually further away from SciPy's than when I use the t-distribution.
Bonus question :)
(More statistical theory related; feel free to ignore)
Also, the t-statistic is negative. I was just wondering what this means for the one-sided t-test. Does this typically mean that I should be looking in the negative axis direction for the test? In my test data, population 1 is a control group who did not receive a certain employment training program. Population 2 did receive it, and the measured data are wage differences before/after treatment.
So I have some reason to think that the mean for population 2 will be larger. But from a statistical theory point of view, it doesn't seem right to concoct a test this way. How could I have known to check (for the one-sided test) in the negative direction without relying on subjective knowledge about the data? Or is this just one of those frequentist things that, while not philosophically rigorous, needs to be done in practice?
By using the SciPy built-in function source(), I could see a printout of the source code for the function ttest_ind(). Based on the source code, the SciPy built-in is performing the t-test assuming that the variances of the two samples are equal. It is not using the Welch-Satterthwaite degrees of freedom. SciPy assumes equal variances but does not state this assumption.
I just want to point out that, crucially, this is why you should not just trust library functions. In my case, I actually do need the t-test for populations of unequal variances, and the degrees of freedom adjustment might matter for some of the smaller data sets I will run this on.
As I mentioned in some comments, the discrepancy between my code and SciPy's is about 0.008 for sample sizes between 30 and 400, and then slowly goes to zero for larger sample sizes. This is an effect of the extra (1/n1 + 1/n2) term in the equal-variances t-statistic denominator. Accuracy-wise, this is pretty important, especially for small sample sizes. It definitely confirms to me that I need to write my own function. (Possibly there are other, better Python libraries, but this at least should be known. Frankly, it's surprising this isn't anywhere up front and center in the SciPy documentation for ttest_ind()).
You are not calculating the sample variance, but instead you are using population variances. Sample variance divides by n-1, instead of n. np.var has an optional argument called ddof for reasons similar to this.
This should give you your expected result:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]
num2 = pop2.shape[0];
var1 = np.var(pop1, ddof=1)
var2 = np.var(pop2, ddof=1)
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2)) / np.sqrt(var1/num1 + var2/num2)
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((var1/num1 + var2/num2)**(2.0))/((var1/num1)**(2.0)/(num1-1) + (var2/num2)**(2.0)/(num2-1))
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
PS: SciPy is open source and mostly implemented with Python. You could have checked the source code for ttest_ind and find out your mistake yourself.
For the bonus side: You don't decide on the side of the one-tail test by looking at your t-value. You decide it beforehand with your hypothesis. If your null hypothesis is that the means are equal and your alternative hypothesis is that the second mean is larger, then your tail should be on the left (negative) side. Because sufficiently small (negative) values of your t-value would indicate that the alternative hypothesis is more likely to be true instead of the null hypothesis.
Looks like you forgot **2 to the numerator of your df. The Welch-Satterthwaite degrees of freedom.
df = (np.var(pop1)/num1 + np.var(pop2)/num2)/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
should be:
df = (np.var(pop1)/num1 + np.var(pop2)/num2)**2/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )