I am plotting both a distribution of test scores and a fitted curve to these test scores:
h = sorted(data['Baseline']) #sorted
fit = stats.norm.pdf(h, np.mean(h), np.std(h))
plt.plot(h,fit,'-o')
plt.hist(h,normed=True) #use this to draw histogram of your data
plt.show()
The plot of the pdf, however, does not look normal (see kink in curve near x=60). See output:
I'm not sure what is going on here...any help appreciated. Is this because the normal line is being drawn between supplied observations? Can provide you with the actual data if needed, there are only 60 observations.
Yes, you evaluate the norm-pdf on the overservation. You would instead want to create some other data like
h = sorted(data['Baseline']) #sorted
x = np.linspace(h.min(), h.max(), 151)
fit = stats.norm.pdf(x, np.mean(h), np.std(h))
plt.plot(x,fit,'-')
plt.hist(h,normed=True)
plt.show()
Note however, that the data does not look normally distributed at all. So potentially you would rather fit a different distribution, or maybe perform a kernel density estimate.
Related
I built a GMM model and used this to run a prediction.
bead = df['Ce140Di']
dna = df['DNA_1']
X = np.column_stack((dna, bead)) # create a 2D array from the two lists
#plt.scatter(X[:,0], X[:,1], s=0.5, c='black')
#plt.show()
gmm = GaussianMixture(n_components=4, covariance_type='tied')
gmm.fit(X)
labels = gmm.predict(X)
and then generated a plot as follows...
df['predicted_cluster'] = labels
fig= plt.figure()
colors = {1:'red', 2:'orange', 3:'purple', 0:'grey'}
plt.scatter(df['DNA_1'], df['Ce140Di'], c=df['predicted_cluster'].apply(lambda x: colors[x]), s = 0.5, alpha=0.5)
plt.show()
scatter plot colored by predictions
Whilst I have the output prediction for each row of my df, I don't actually know what cluster it corresponds to without looking at my colors dictionary, is there a way to do this without having to look at the scatter plot each time?
In other words, I want to know that 0 will always correspond to my grey cluster or that 1 will always be the red cluster but this changes each time...
Colors aside, how do I know the position of each cluster? What does a label of 0 mean?
EDIT I believe the answer to my perhaps silly question is to use np.random.seed but I could be wrong...
Helllo Hajar,
I think the answer to your question will disappoint you. I assume each Gaussian in your GMM is initialised to some random mean and variance. If you set a random seed then you could be reasonably certain that the resultant clusters will always be the same.
With that said, in multi-label scenarios without a random seed there are (to my knowledge) no clustering algorithms that guarantee which label is assigned to each cluster.
Clustering algorithms assign labels arbitrarily. The only guarantee any clustering algorithm makes about a point assigned a certain label is that it is similar to other points with the same label by some metric.
This makes measuring the accuracy of clustering algorithms quite challenging. Hence the existence of metrics like the Adjusted Mutual Information Score and the Adjusted Rand Index.
You could account for this with a sort of semi-supervised approach, in which you force a particular point to start with a "ground-truth" label and hope your algorithm centres a cluster on it, but even then there may be variance.
Goodluck and I hope this helps.
I have a set of 3d coordinates (x,y,z) to which I would like to fit a space curve. Does anyone know of existing routines for this in Python?
From what I have found (https://docs.scipy.org/doc/scipy/reference/interpolate.html), there are existing modules for fitting a curve to a set of 2d coordinates, and others for fitting a surface to a set of 3d coordinates. I want the middle path - fitting a curve to a set of 3d coordinates.
EDIT --
I found an explicit answer to this on another post here, using interpolate.splprep() and interpolate.splenv(). Here are my data points:
import numpy as np
data = np.array([[21.735556483642707, 7.9999120559310359, -0.7043281314370935],
[21.009401429607784, 8.0101161320825103, -0.16388503829177037],
[20.199370045383134, 8.0361339131845497, 0.25664085801558179],
[19.318149385194054, 8.0540100864979447, 0.50434139043379278],
[18.405497793567243, 8.0621753888918484, 0.57169888018720161],
[17.952649703401562, 8.8413995204241491, 0.39316793526155014],
[17.539007529982641, 9.6245700151356104, 0.14326173861202204],
[17.100154581079089, 10.416295524018977, 0.011339000091976647],
[16.645143439968102, 11.208477191735446, 0.070252116425261066],
[16.198247656768263, 11.967005154933993, 0.31087815045809558],
[16.661378578010989, 12.717314230004659, 0.54140549139204996],
[17.126106263351478, 13.503461982612732, 0.57743407626794219],
[17.564249250974573, 14.28890107482801, 0.42307198199366186],
[17.968265052275274, 15.031985807202176, 0.10156997950061938]])
Here is my code:
from scipy import interpolate
from mpl_toolkits.mplot3d import Axes3D
data = data.transpose()
#now we get all the knots and info about the interpolated spline
tck, u= interpolate.splprep(data, k=5)
#here we generate the new interpolated dataset,
#increase the resolution by increasing the spacing, 500 in this example
new = interpolate.splev(np.linspace(0,1,500), tck, der=0)
#now lets plot it!
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(data[0], data[1], data[2], label='originalpoints', lw =2, c='Dodgerblue')
ax.plot(new[0], new[1], new[2], label='fit', lw =2, c='red')
ax.legend()
plt.savefig('junk.png')
plt.show()
This is the image:
You can see that the fit is not good, while I am already using the maximum allowed fitting order value (k=5). Is this because the curve is not fully convex? Does anyone know how I can improve the fit?
Depends on what the points represent, but if it's just position data, you could use a kalman filter such as this one written in python. You could just query the kalman filter at any time to get the "expected point" at that time, so it would work just like a function of time.
If you do plan to use a kalman filter, just set the initial estimate to your first coordinate, and set your covariance to be a diagonal matrix of huge numbers, this will indicate that you are very uncertain about the position of your next point, which will quickly lock the filter onto your coordinates.
You'd want to stay away from spline fitting methods, because splines will always go through your data.
You can fit a curve to any dimensional data. The curve fitting / optimization algorithms (say, in scipy.optimize) all treat the observations you want to model as a plain 1-d array, and do not care what the independent variables are. If you flatten your 3d data, each value will correspond to an (x, y, z) tuple. You can just pass that information along as "extra" data to you fitting routine to help you calculate the model curve that will be fitted to your data.
I'm trying to draw a curve for regression fitting. The curve is for a higher degree polynomial ( 6 and above ).
fig=figure()
ax1=fig.add_axes((0.1,0.2,0.8,0.7))
ax1.set_title("Training data(blue) and fitting curve(red)")
ax1.set_xlabel('X-axis')
ax1.set_ylabel('Y-axis')
ax1.plot(x_train,y_train,'.',x_train,np.polyval(best_coef,x_train),'-r')
show()
This is the output of the given code
I want it to be a smooth curve.
something like this , with a continues red line , instead of discreet point of red
I think you just need to sort x_train before plotting the fit results:
ax1.plot(x_train,y_train,'.', np.sort(x_train),np.polyval(best_coef,np.sort(x_train)),'-r')
The plot you included looks like the x_train values (and therefore also the fitted values) are in random order, but the plot routine does not connect the nearest points, but consecutive points in the arrays.
I have plotted some experimental data in Python and need to find a cubic fit to the data. The reason I need to do this is because the cubic fit will be used to remove background (in this case resistance in a diode) and you will be left with the evident features. Here is the code I am currently using to make the cubic fit in the first place, where Vnew and yone represent arrays of the experimental data.
answer1=raw_input ('Cubic Plot attempt?\n ')
if answer1 in['y','Y','Yes']:
def cubic(x,A):
return A*x**3
cubic_guess=array([40])
popt,pcov=curve_fit(cubic,Vnew,yone,cubic_guess)
plot(Vnew,cubic(Vnew,*popt),'r-',label='Cubic Fit: curve_fit')
#ylim(-0.05,0.05)
legend(loc='best')
print 'Cubic plotted'
else:
print 'No Cubic Removal done'
I have knowledge of curve smoothing but only in theory. I do not know how to implement it. I would really appreciate any assistance.
Here is the graph generated so far:
To make the fitted curve "wider", you're looking for extrapolation. Although in this case, you could just make Vnew cover a larger interval, in which case you'd put this before your plot command:
Vnew = numpy.linspace(-1,1, 256) # min and max are merely an example, based on your graph
plot(Vnew,cubic(Vnew,*popt),'r-',label='Cubic Fit: curve_fit')
"Blanking out" the feature you see, can be done with numpy's masked arrays but also just by removing those elements you don't want from both your original Vnew (which I'll call xone) and yone:
mask = (xone > 0.1) & (xone < 0.35) # values between these voltages (?) need to be removed
xone = xone[numpy.logical_not(mask)]
yone = yone[numpy.logical_not(mask)]
Then redo the curve fitting:
popt,_ = curve_fit(cubic, xone, yone, cubic_guess)
This will have fitted only to the data that was actually there (which aren't that many points in your dataset, from the looks of it, so beware!).
I want to interpolate some data, and to plot the result on a log scale (pyplot.loglog). The problem is that the resulting interpolation looks very strange and shows discontinuities when plotted on a log scale. What is the best way to interpolate log scaled data?
pyplot.loglog(x, y, '+')
pyplot.hold(True)
s = scipy.interpolate.InterpolatedUnivariateSpline(x, y)
xs = numpy.logspace(numpy.log10(numpy.min(x)), numpy.log10(numpy.max(x)))
pyplot.loglog(xs, s(xs)) # This looks very strange because of the log scale!
Actually, I succeed doing it by interpolating the log of the data, but I was wondering if there were a simpler way of achieving the same result?
pyplot.loglog(x, y, '+')
pyplot.hold(True)
s = scipy.interpolate.InterpolatedUnivariateSpline(numpy.log10(x), numpy.log10(y))
xs = numpy.logspace(numpy.log10(numpy.min(x)), numpy.log10(numpy.max(x)))
pyplot.loglog(xs, numpy.power(10, s(numpy.log10(xs)))
Looks like taking logarithm of the data first, then fitting is a normal way to do this. See Fitting a Power Law Distribution.