I have a time series of position of a particle over time and I want to estimate model parameters of two HMM using this data (one for the x axis, the other for the y axis). I'm using the hmmlearn library, however, it is not clear to me how should I proced. In the tutorial, it states that this is the third way to use the library, however, when I use the code as bellow:
remodel = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100)
remodel.fit(X)
Z2 = remodel.predict(X)
and X is the list of x-axis values, it returns
ValueError: Expected 2D array, got 1D array instead
What should I add to my data in order to turn it 2D?
Caveat emptor: My understanding of HMM and this lib are based on a few minutes of googling and Wikipedia. That said:
To train an HMM model, you need a number of observes samples, each of which is a vector of features. For example, in the Wikipedia example of Alice predicting the weather at Bob's house based on what he did each day, Alice gets a number of samples (what Bob tells her each day), each of which has one feature (Bob's reported activity that day). It would be entirely possible for Bob to give Alice multiple features for a given day (what he did, and what his outfit was, for instance).
To learn/fit an HMM model, then, you should need a series of samples, each of which is a vector of features. This is why the fit function expects a two-dimensional input. From the docs, X is expected to be "array-like, shape (n_samples, n_features)". In your case, the position of the particle is the only feature, with each observation being a sample. So your input should be an array-like of shape n_samples, 1 (a single column). Right now, it's presumably of shape 1, n_samples (a single row, the default from doing something like np.array([1, 2, 3])). So just reshape:
remodel.fit(X.reshape(-1, 1))
for me, the reshape method didn't work. I used the numpy's np.column_stack instead. I suggest you insert X = np.column_stack([X[:]]) before fitting the model, it should work out.
Related
I'm attempting to do a regression to fit a function to some data points I have, these are simply put (x,y) where x = date and y = a point of data. Seems simple enough.
I'm following along on a how-to and it comes to the part where you split your data into training/testing, that much I understand, but the input for model.fit is a 2D array and then labels.
I think I'm being incredibly dense, but this is what I have for that:
model.fit(input, date_time_training)
My input is an array like so [[5, 3], [7,5], etc] my "labels" are dates because that's how I'd want to label my data but that's not right, they need to be numbers. There are two things it could be, though, my data points which are y on my graph and my x-axis which are dates. I converted my dates into numbers (0,1,2,3,etc) corresponding to each date.
Is that also what my labels would be?
Also my input is just [[date_converted_to_int, score], etc] which when looking at the documentation, seemingly that should be [[points, features], etc]. I'm pretty confused, obviously not super experienced with regression either (otherwise I'm guessing this would be clearer).
You are trying to predict {actual term is forecast in this case} your y over time.
So, It is more suitable to use a time series model in this case. Because by definition this is a time series use case.
[time series: you try to understand the evolution of values of an attribute over time]
Try some models like:
AR
ARIMA
and
statsmodel would be a nice place to visit by for documentation
the sklearn.datasets has a function called make_s_curve that returns a dataset that represent a 3D S shape looks like this:
The function returns the dataset with is of shape [num_samples, 3]. It also returns a second output which is of shape [num_samples] and explained that it is
the univariate position of the sample according to the main dimension
of the points in the manifold.
I don't understand what this mean, is this some particular ordering of the points in the dataset ?
Really appreciate any help.
the univariate position
This means that the postion will not change
according to the main dimension of the points
And the order of the position is based on the main dimension.
in the manifold
and manifold is a topological space.
The idea behind that is that you can compare the plot, and the points always have the same order in respect to their dimensions.
I'm trying to plot a 3-feature dataset with a binary classification on a matplotlib plot. This worked with an example dataset provided in a guide (http://www.apnorton.com/blog/2016/12/19/Visualizing-Multidimensional-Data-in-Python/) but when I try to instead insert my own dataset, the LinearDiscriminantAnalysis will only output a one-dimensional series, no matter what number I put in "n_components". Why would this not work with my own code?
Data = pd.read_csv("DataFrame.csv", sep=";")
x = Data.iloc[:, [3, 5, 7]]
y = Data.iloc[:, 8]
lda = LDA(n_components=2)
lda_transformed = pd.DataFrame(lda.fit_transform(x, y))
plt.scatter(lda_transformed[y==0][0], lda_transformed[y==0][1], label='Loss', c='red')
plt.scatter(lda_transformed[y==1][0], lda_transformed[y==1][1], label='Win', c='blue')
plt.legend()
plt.show()
In the case when the number of different class labels, C, is less than the number of observations (almost always), then linear discriminant analysis will always produce C - 1 discriminating components. Using n_components from the sklearn API is only a means to choose possibly fewer components, e.g. in the case when you know what dimensionality you'd like to reduce down to. But you could never use n_components to get more components.
This is discussed in the Wikipedia section on Multiclass LDA. The definition of the between-class scatter is given as
\Sigma_{b} = (1 / C) \sum_{i}^{C}( (\mu_{i} - mu)(\mu_{i} - mu)^{T}
which is the empirical covariance matrix among the population of class means. By definition, such a covariance matrix has rank at most C - 1.
... the variability between features will be contained in the subspace spanned by the eigenvectors corresponding to the C − 1 largest eigenvalues ...
So because LDA uses a decomposition of the class mean covariance matrix, it means the dimensionality reduction it can provide is based on the number of class labels, and not on the sample size nor the feature dimensionality.
In the example you linked, it doesn't matter how many features there are. The point is that the example uses 3 simulated cluster centers, so there are 3 class labels. This means linear discriminant analysis could produce projection of the data onto either 1-dimensional or 2-dimensional discriminating subspaces.
But in your data, you start out with only 2 class labels, a binary problem. This means the dimensionality of the linear discriminant model can be at most 1-dimensional, literally a line that forms the decision boundary between the two classes. Dimensionality reduction with LDA in this case would simply be the projection of data points onto a particular normal vector of that separating line.
If you want to specifically reduce down to two dimensions, you can try many of the other algorithms that sklearn provides: t-SNE, ISOMAP, PCA and kernel PCA, random projection, multi-dimensional scaling, among others. Many of these allow you to choose the dimensionality of the projected space, up to the original feature dimensionality, or sometimes you can even project into larger spaces, like with kernel PCA.
In the example you give, dimension reduction by LDA reduces the data from 13 features to 2 features, however in your example it reduces from 3 to 1 (even though you wanted to get 2 features), thus it is not possible to plot in 2D.
If you really want to select 2 features out of 3, you can use feature_selection.SelectKBest to choose 2 best features and there won't be any problems plotting in 2D.
For more information, please read this fantastic answer for PCA:
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
Probably it's because of sklearn implementation that won't allowed you to do so if you only have 2 class. The problem has been stated in here, https://github.com/scikit-learn/scikit-learn/issues/1967.
so I'm currently working on a project that involves the use of Principal Component Analysis, or PCA, and I'm attempting to kind of learn it on the fly. Luckily, Python has a very convenient module from scikitlearn.decomposition that seems to do most of the work for you. Before I really start to use it though, I'm trying to figure out exactly what it's doing.
The dataframe I've been testing on looks like this:
0 1
0 1 2
1 3 1
2 4 6
3 5 3
And when I call PCA.fit() and then view the components I get:
array([[ 0.5172843 , 0.85581362],
[ 0.85581362, -0.5172843 ]])
From my rather limited knowledge of PCA, I kind of grasp how this was calculated, but where I get lost is when I then call PCA.transform. This is the output it gives me:
array([[-2.0197033 , -1.40829634],
[-1.84094831, 0.8206152 ],
[ 2.95540408, -0.9099927 ],
[ 0.90524753, 1.49767383]])
Could someone potentially walk me through how it takes the original dataframe and components and transforms it into this new array? I'd like to be able to understand the exact calculations it's doing so that when I scale up I'll have a better sense of what's going on. Thanks!
When you call fit PCA is going to compute some vectors that you can project your data onto in order to reduce the dimension of your data. Since each row of your data is 2 dimensional there will be a maximum of 2 vectors onto which data can be projected and each of those vectors will be 2-dimensional. Each row of PCA.components_ is a single vector onto which things get projected and it will have the same size as the number of columns in your training data. Since you did a full PCA you get 2 such vectors so you get a 2x2 matrix. The first of those vectors will maximize the variance of the projected data. The 2nd will maximize the variance of what's left after the first projection. Typically one passed a value of n_components that's less than the dimension of the input data so that you get back fewer rows and you have a wide but not tall components_ array.
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called. For each row of the data you pass to transform you'll have 1 row in the output and the number of columns in that row will be the number of vectors that were learned in the fit phase. In other words, the number of columns will be equal to the value of n_components you passed to the constructor.
Typically one uses PCA when the source data has lots of columns and you want to reduce the number of columns while preserving as much information as possible. Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension. If you then called transform all 100 rows of your data would be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.
The short answer to how this is done is that PCA computes a Singular Value Decomposition and then keeps only some of the columns of one of those matrices. Wikipedia has much more information on the actual linear algebra behind this - it's a bit long for a StackOverflow answer.
I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?
You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)
It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.