Python Gaussian Process Regression with sklearn - python

I'm attempting to do a regression to fit a function to some data points I have, these are simply put (x,y) where x = date and y = a point of data. Seems simple enough.
I'm following along on a how-to and it comes to the part where you split your data into training/testing, that much I understand, but the input for model.fit is a 2D array and then labels.
I think I'm being incredibly dense, but this is what I have for that:
model.fit(input, date_time_training)
My input is an array like so [[5, 3], [7,5], etc] my "labels" are dates because that's how I'd want to label my data but that's not right, they need to be numbers. There are two things it could be, though, my data points which are y on my graph and my x-axis which are dates. I converted my dates into numbers (0,1,2,3,etc) corresponding to each date.
Is that also what my labels would be?
Also my input is just [[date_converted_to_int, score], etc] which when looking at the documentation, seemingly that should be [[points, features], etc]. I'm pretty confused, obviously not super experienced with regression either (otherwise I'm guessing this would be clearer).

You are trying to predict {actual term is forecast in this case} your y over time.
So, It is more suitable to use a time series model in this case. Because by definition this is a time series use case.
[time series: you try to understand the evolution of values of an attribute over time]
Try some models like:
AR
ARIMA
and
statsmodel would be a nice place to visit by for documentation

Related

Scipy Detrend in python

I detrended my data in python using the following code from scipy.signal.detrend
detrended =signal.detrend(feature, axis=-1, type='constant', bp=0, overwrite_data=True)
np.savetxt('constant detrend.csv', detrended, delimiter=',', fmt='%s')
The last line saves the data into a csv file then i reload this data to run some models. I found that the my RandomForest model performs really well with the detrended dataset.
So next will be to make forecasts using this model. However i am a bit unsure of how i can move from the detrended dataset to a more meaningful dataset that i can understand. From my understanding the detrend removed the mean and normalized the data. But if i do my predictions i need to be able to see the actual numbers of my forecasts not the detrended numbers.
Is there a way i can readd the mean and renormalize to get a 'meanful dataset' that i can interpret. For example my dataset has a rainfall variable. So for each month i can see how much it rained. But when i detrended, the rainfall value is no longer the actual rainfall value. When i make forecasts i want to be able to say in this month it rained 200mm but my forecasts don't tell me this since the data has been detrended.
Any help would be appraciated.
According to the docs, detrend simply removes the least squares line fit from the data. When you use type='constant', it's even simpler, since it just removes the mean:
If type == 'constant', only the mean of data is subtracted.
The source code bears this out. After checking the inputs, the entire computation is done in one line (scipy/signal/signaltools.py, line 3261):
ret = data - np.expand_dims(np.mean(data, axis), axis)
The easiest way to get the subtracted mean is to implement the calculation by hand, given how simple it is.
mean = np.mean(feature, axis=-1, keepdims=True)
detrended = feature - mean
You can save the mean to a file, or do whatever else you want with it. To "retrend", just add the mean back:
point = prediction + mean
If you had some other manipulation you were concerned with, like normalizing to the maximum, you could handle it the same way.
max = np.amax(detrended, axis=-1, keepdims=True)
detrended /= max
In this case you'd have to multiply before offsetting to retrend:
point = prediction * max + mean
Simple manipulations like this are easy to reproduce by hand. A more complicated function might be hard to reproduce reliably, but would also be more likely to return the parameters it uses, at least optionally.

Hidden Markov Model python

I have a time series of position of a particle over time and I want to estimate model parameters of two HMM using this data (one for the x axis, the other for the y axis). I'm using the hmmlearn library, however, it is not clear to me how should I proced. In the tutorial, it states that this is the third way to use the library, however, when I use the code as bellow:
remodel = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100)
remodel.fit(X)
Z2 = remodel.predict(X)
and X is the list of x-axis values, it returns
ValueError: Expected 2D array, got 1D array instead
What should I add to my data in order to turn it 2D?
Caveat emptor: My understanding of HMM and this lib are based on a few minutes of googling and Wikipedia. That said:
To train an HMM model, you need a number of observes samples, each of which is a vector of features. For example, in the Wikipedia example of Alice predicting the weather at Bob's house based on what he did each day, Alice gets a number of samples (what Bob tells her each day), each of which has one feature (Bob's reported activity that day). It would be entirely possible for Bob to give Alice multiple features for a given day (what he did, and what his outfit was, for instance).
To learn/fit an HMM model, then, you should need a series of samples, each of which is a vector of features. This is why the fit function expects a two-dimensional input. From the docs, X is expected to be "array-like, shape (n_samples, n_features)". In your case, the position of the particle is the only feature, with each observation being a sample. So your input should be an array-like of shape n_samples, 1 (a single column). Right now, it's presumably of shape 1, n_samples (a single row, the default from doing something like np.array([1, 2, 3])). So just reshape:
remodel.fit(X.reshape(-1, 1))
for me, the reshape method didn't work. I used the numpy's np.column_stack instead. I suggest you insert X = np.column_stack([X[:]]) before fitting the model, it should work out.

How to correctly translate Kmeans labels to category labels

I have been using Sklearn's Kmeans implementation
I have been clustering a dataset which is labeled, and I have been using sklearn's clustering metrics in order to test the clustering performance.
Sklearn's Kmeans clustering output is as you know a list of numbers in the range of k_clusters. However my labels are strings.
So far I had no problems with them since the metrics from sklearn.metrics.cluster work with mixed inputs (int & str label lists).
However now I want to use some of the classification metrics and from what I gather, the inputs k_true and k_pred need to be of the same set. Either numbers in range of k, or then string labels that my dataset is using. If I try it, it returns the following error:
AttributeError: 'bool' object has no attribute 'sum'
So, how could I translate the k_means labels into an other type of labels? Or even the other way around (string labels -> integer labels).
How could I even begin implementing it? Since k_means is pretty non-deterministic, the labels might change from iteration to iteration. Is there a legit way in order to correctly translate Kmeans labels?
EDIT:
EXAMPLE
for k = 4
kmeans output: [0,3,3,2,........0]
class labels : ['CAT','DOG','DOG','BIRD',.......'CHICKEN']
Clustering is not classification.
The methods do not predict a label, so you must not use a classification evaluation measure. That would be like measuring the quality of an apple in miles per gallon...
If you insist on doing the wrong thing(tm) then use the Hungarian algorithm to find the best mapping. But beware: the number of clusters and the number of classes will usually not be the same. If this is the case, using such a mapping will either be unfairly negative (not mapping extra clusters) or unfairly positive (mapping !uktiple clusters to the same label will consider the N points are N clusters "solution" optimal). It's better to only use clustering measures.
You can create mapping using a dictionary, say
mapping_dict = { 0: 'cat', 1: 'chicken', 2:'bird', 3:'dog'}
Then you can simply apply this mapping using say list comprehension,etc.
Suppose your labels are stored in a list kmeans_predictions
mapped_predictions = [ mapping_dict[x] for x in kmeans_predictions]
Then use mapped_predictions as your predictions
Update : Based on your comments, i believe you have to do it the other way round. I mean convert your labels into `int' mappings.
Also, you cannot use just any classification metric here. Use Completeness score, v-measure and homogenity as these are more suited for clustering problems. It would be incorrect to just blindly use any random classification metric here.

Confusion matrix for Clustering in scikit-learn

I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.
I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.
However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.
Rows - Actual labels
Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)
Is there a way to do this?
Edit: Here are more details.
In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.
That's why it gives a matrix which has the same labels for both rows and columns like this.
But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)
Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.
ValueError: Mix of label input types (string and number)
This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.
With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.
Hope the question is now clearer. Please let me know if it isn't.
I wrote a code myself.
# Compute confusion matrix
def confusion_matrix(act_labels, pred_labels):
uniqueLabels = list(set(act_labels))
clusters = list(set(pred_labels))
cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
for i, act_label in enumerate(uniqueLabels):
for j, pred_label in enumerate(pred_labels):
if act_labels[j] == act_label:
cm[i][pred_label] = cm[i][pred_label] + 1
return cm
# Example
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
cnf_matrix = confusion_matrix(labels, pred)
print('\n'.join([''.join(['{:4}'.format(item) for item in row])
for row in cnf_matrix]))
Edit:
(Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'Labels': labels, 'Clusters': pred})
# Create crosstab: ct
ct = pd.crosstab(df['Labels'], df['Clusters'])
# Display ct
print(ct)
You can easily compute a pairwise intersection matrix.
But it may be necessary to do this yourself, if the sklearn library has been optimized for the classification use case.

Pipeline with PolynomialFeatures and LinearRegression - unexpected result

with the following code I just want to fit a regression curve to sample data which is not working as expected.
X = 10*np.random.rand(100)
y= 2*X**2+3*X-5+3*np.random.rand(100)
xfit=np.linspace(0,10,100)
poly_model=make_pipeline(PolynomialFeatures(2),LinearRegression())
poly_model.fit(X[:,np.newaxis],y)
y_pred=poly_model.predict(X[:,np.newaxis])
plt.scatter(X,y)
plt.plot(X[:,np.newaxis],y_pred,color="red")
plt.show()
Shouldnt't there be a curve which is perfectly fitting to the data points? Because the training data (X[:,np.newaxis]) and the data which get used to predict y_pred are the same (also (X[:,np.newaxis]).
If I instead use the xfit data to predict the model the result is as desired...
...
y_pred=poly_model.predict(xfit[:,np.newaxis])
plt.scatter(X,y)
plt.plot(xfit[:,np.newaxis],y_pred,color="red")
plt.show()
So whats the issue and the explanation for such a behaviour?
The difference between the two plots is that in the line
plt.plot(X[:,np.newaxis],y_pred,color="red")
The values in X[:,np.newaxis] are not sorted, while in
plt.plot(xfit[:,np.newaxis],y_pred,color="red")
the values of xfit[:,np.newaxis] are sorted.
Now, plt.plot connects any two consecutive values in the array by line, and since they are not sorted you get this bunch of lines in your first figure.
Replace
plt.plot(X[:,np.newaxis],y_pred,color="red")
with
plt.scatter(X[:,np.newaxis],y_pred,color="red")
and you'll get this nice looking figure:
Based on the answer of Miriam Farber I have figured out an other way. Since the X values are not sorted I can fix the issue by simply sort the values with:
X=np.sort(X)
Now the remaining code can remain stationary and will deliver the desired result.

Categories

Resources