sklearn.datasets.make_s_curve output meaning - python

the sklearn.datasets has a function called make_s_curve that returns a dataset that represent a 3D S shape looks like this:
The function returns the dataset with is of shape [num_samples, 3]. It also returns a second output which is of shape [num_samples] and explained that it is
the univariate position of the sample according to the main dimension
of the points in the manifold.
I don't understand what this mean, is this some particular ordering of the points in the dataset ?
Really appreciate any help.

the univariate position
This means that the postion will not change
according to the main dimension of the points
And the order of the position is based on the main dimension.
in the manifold
and manifold is a topological space.
The idea behind that is that you can compare the plot, and the points always have the same order in respect to their dimensions.

Related

Python Gaussian Process Regression with sklearn

I'm attempting to do a regression to fit a function to some data points I have, these are simply put (x,y) where x = date and y = a point of data. Seems simple enough.
I'm following along on a how-to and it comes to the part where you split your data into training/testing, that much I understand, but the input for model.fit is a 2D array and then labels.
I think I'm being incredibly dense, but this is what I have for that:
model.fit(input, date_time_training)
My input is an array like so [[5, 3], [7,5], etc] my "labels" are dates because that's how I'd want to label my data but that's not right, they need to be numbers. There are two things it could be, though, my data points which are y on my graph and my x-axis which are dates. I converted my dates into numbers (0,1,2,3,etc) corresponding to each date.
Is that also what my labels would be?
Also my input is just [[date_converted_to_int, score], etc] which when looking at the documentation, seemingly that should be [[points, features], etc]. I'm pretty confused, obviously not super experienced with regression either (otherwise I'm guessing this would be clearer).
You are trying to predict {actual term is forecast in this case} your y over time.
So, It is more suitable to use a time series model in this case. Because by definition this is a time series use case.
[time series: you try to understand the evolution of values of an attribute over time]
Try some models like:
AR
ARIMA
and
statsmodel would be a nice place to visit by for documentation

PCA in Sklearn: how to return dimensions which explain the most variation, in order? [duplicate]

I need to use pca to identify the dimensions with the highest variance of a certain set of data. I'm using scikit-learn's pca to do it, but I can't identify from the output of the pca method what are the components of my data with the highest variance. Keep in mind that I don't want to eliminate those dimensions, only identify them.
My data is organized as a matrix with 150 rows of data, each one with 4 dimensions. I'm doing as follow:
pca = sklearn.decomposition.PCA()
pca.fit(data_matrix)
When I print pca.explained_variance_ratio_, it outputs an array of variance ratios ordered from highest to lowest, but it doesn't tell me which dimension from the data they correspond to (I've tried changing the order of columns on my matrix, and the resulting variance ratio array was the same).
Printing pca.components_ gives me a 4x4 matrix (I left the original number of components as argument to pca) with some values I can't understand the meaning of...according to scikit's documentation, they should be the components with the maximum variance (the eigenvectors perhaps?), but no sign of which dimension those values refer to.
Transforming the data doesn't help either, because the dimensions are changed in a way I can't really know which one they were originally.
Is there any way I can get this information with scikit's pca? Thanks
The pca.explained_variance_ratio_ returned are the variances from principal components. You can use them to find how many dimensions (components) your data could be better transformed by pca. You can use a threshold for that (e.g, you count how many variances are greater than 0.5, among others). After that, you can transform the data by PCA using the number of dimensions (components) that are equal to principal components higher than the threshold used. The data reduced to these dimensions are different from the data on dimensions in original data.
you can check the code from this link:
http://scikit-learn.org/dev/tutorial/statistical_inference/unsupervised_learning.html#principal-component-analysis-pca

Hidden Markov Model python

I have a time series of position of a particle over time and I want to estimate model parameters of two HMM using this data (one for the x axis, the other for the y axis). I'm using the hmmlearn library, however, it is not clear to me how should I proced. In the tutorial, it states that this is the third way to use the library, however, when I use the code as bellow:
remodel = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100)
remodel.fit(X)
Z2 = remodel.predict(X)
and X is the list of x-axis values, it returns
ValueError: Expected 2D array, got 1D array instead
What should I add to my data in order to turn it 2D?
Caveat emptor: My understanding of HMM and this lib are based on a few minutes of googling and Wikipedia. That said:
To train an HMM model, you need a number of observes samples, each of which is a vector of features. For example, in the Wikipedia example of Alice predicting the weather at Bob's house based on what he did each day, Alice gets a number of samples (what Bob tells her each day), each of which has one feature (Bob's reported activity that day). It would be entirely possible for Bob to give Alice multiple features for a given day (what he did, and what his outfit was, for instance).
To learn/fit an HMM model, then, you should need a series of samples, each of which is a vector of features. This is why the fit function expects a two-dimensional input. From the docs, X is expected to be "array-like, shape (n_samples, n_features)". In your case, the position of the particle is the only feature, with each observation being a sample. So your input should be an array-like of shape n_samples, 1 (a single column). Right now, it's presumably of shape 1, n_samples (a single row, the default from doing something like np.array([1, 2, 3])). So just reshape:
remodel.fit(X.reshape(-1, 1))
for me, the reshape method didn't work. I used the numpy's np.column_stack instead. I suggest you insert X = np.column_stack([X[:]]) before fitting the model, it should work out.

plotting a line in matplotlib in python [duplicate]

I am trying to plot the decision boundary of a perceptron algorithm and am really confused about a few things. My input instances are in the form [(x1,x2),target_Value], basically a 2-d input instance and a 2 class target_value [1 or 0].
My weight vector hence is in the form: [w1,w2] Now I have to incorporate an additional bias parameter w0 and hence my weight vector becomes a 3x1 vector? is it 1x3 vector? I think it should be 1x3 since a vector has only 1 row and n columns.
Now let's say I instantiate [w0,w1,w2] to random values, how would I plot the decision boundary for this? Meaning what does w0 signify here? Is w0/norm(w) the distance of the decision region from the origin? If so how do I capture this and plot it in python using matplotlib.pyplot or its matlab equivalent? I would really appreciate even a little help regarding this matter.
from pylab import norm
import matplotlib.pyplot as plt
n = norm(weight_vector) #this is of the form [w0,w1,w2], w0 is bias parameter
ww = weight_vector/n #unit vector in the direction of weight_vector
ww1 = [ww[1],-ww[0]]
ww2 = [-ww[1],ww[0]]
plot([ww1[0], ww2[0]],[ww1[1], ww2[1]],'--k')
Here I want to incorporate the w0 parameter to indicate the distance of the displacement of the weight vector from the origin since that's what w0/norm(w) indicates?
When I plot the vector as mentioned in the comments below I get a vector of really small length, how would it be possible for me to extend this decision boundary in both directions?
The small dashed line near location [0,0] in the figure is my decision region, how can I make it longer in both directions? If I try to multiply each of its components, the figure scale changes, I am using matplotlib.pyplot.plot() function to achieve this.
First of all, you shouldn't add the bias to the input vectors. You only need to subtract or add the bias to all of your input vectors.
For plotting, you might want to try plot the linear function that passes the two weight points.

Circular Dimensionality Reduction?

I want dimensionality reduction such that dimensions it returns are circular.
ex) If I reduce 12d data to 2d, normalized between 0 and 1, then I want (0,0) to be as equally close to (.1,.1) as (.9,.9).
What is my algorithm? (bonus points for python implementation)
PCA gives me 2d plane of data, whereas I want spherical surface of data.
Make sense? Simple? Inherent problems? Thanks.
I think what you ask is all about transformation.
Circular
I want (0,0) to be as equally close to (.1,.1) as (.9,.9).
PCA
Taking your approach of normalization what you could do is to
map the values in the interval from [0.5, 1] to [0.5, 0]
MDS
If you want to use a distance metric, you could first compute the distances and then do the same. For instance taking the correlation, you could do 1-abs(corr). Since the correlation is between [-1, 1] positive and negative correlations will give values close to zero, while non correlated data will give values close to one. Then, having computed the distances you use MDS to get your projection.
Space
PCA gives me 2d plane of data, whereas I want spherical surface of data.
Since you want a spherical surface you can directly transform the 2-d plane to a sphere as I think. A spherical coordinate system with a constant Z would do that, wouldn't it?
Another question is then: Is all this a reasonable thing to do?

Categories

Resources