Error in model prediction using hmmlearn - python

Hi I have a dataframe test, I am trying to predict using a Gaussian HMM with hmmlearn.
When I do this:
y = model.predict(test)
y
I get the hmm working fine producing and array of states
however if i do this:
for i in range(0,len(test)):
y = model.predict(test[:i])
all I get is y being set to 1.
Can anyone help?
UPDATE
here is the code that does work iterating through
The training set was 0-249:
for i in range(251,len(X)):
test = X[:i]
y = model.predict(test)
print(y[len(y)-1])

HMM models sequences of observations. If you feed a single observation into predict (which does Viterbi decoding by default) you essentially reduce the prediction to the argmax over
(model.startprob_ * model.predict_proba(test[i:i + 1])).argmax()
which can be dominated by startprob_, e.g. if startprob = [10**-8, 1 - 10**-8]. This could explain the all-ones behaviour you're seeing.

Related

How can I predict the following value of time series using Tensorflow predict method?

I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps. I wonder if the time series data has been properly learned and predicted. How do I do this right to get the following(next) value? I want to get the next value using model.predict or similar.
I have x_test and x_test[-1] == t So, the meaning of the next value is t+1, t+2, .... t+n. In this example I want to get t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same. I'm wondering if it's the correct method to train and predicted value and also if the result was correct.
full source
The method I tried first did not work correctly.
Second
I realized something is wrong, I tried using another official data so I used the time series in the Tensorflow tutorial to practice training the model.
a = y_val[-look_back:]
for i in range(N-step prediction): #predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) #predicted value
a = a[1:] #remove first
a = np.append(a, tmp) #insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression abnormal that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious that How can I predict the following value of time series using Tensorflow predict method
I'm not wondering if this works or not theoretically. I'm just wondering how to get the following n steps using the predict method.
Thank you for reading the long question. I seek advice about your priceless opinion.
"lstm" is normaly used to predict 3D data => target
which input has same time frame number(n , t ,f)
"n" for data number "t" for frame number , "f" for feature number
what you want to predict is 1D data => target
which is (t ,f )
when you have f = 0 then you can only use F(t) => y
if you can estimate function F then u can get y . NN can't help here

Why does AdaBoost not work with DecisionTree?

I'm using sklearn 0.19.1 with DecisionTree and AdaBoost.
I have a DecisionTree classifier that works fine:
clf = tree.DecisionTreeClassifier()
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
clf.fit(train_pdf_x, train_pdf_y)
pred2 = clf.predict(test_pdf_x)
But when trying to add AdaBoost, it throws an error on the predict function:
treeclf = tree.DecisionTreeClassifier(max_depth=3)
adaclf = AdaBoostClassifier(base_estimator=treeclf, n_estimators=500, learning_rate=0.5)
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
adaclf.fit(train_pdf_x, train_pdf_y)
pred2 = adaclf.predict(test_pdf_x)
Specifically the error says:
ValueError: bad input shape (236821, 6)
The dataset that it seems to be pointing to is train_pdf_y because it has a shape of (236821, 6) and I don't understand why.
From even the description of the AdaBoostClassifier in the docs I can understand that the actual classifier that uses the data is the DecisionTree:
An AdaBoost 1 classifier is a meta-estimator that begins by fitting
a classifier on the original dataset and then fits additional copies
of the classifier on the same dataset but where the weights of
incorrectly classified instances are adjusted such that subsequent
classifiers focus more on difficult cases
But still I'm getting this error.
In the code examples I've found, even on sklearn's website with how to use AdaBoost and I can't understand what I'm doing wrong.
Any help is appreciated.
It looks like you are trying to perform a Multi-Output classification problem, given the shape of y, otherwise it does not make sense that you are feeding and n-dimensional y to adaclf.fit(train_pdf_x, train_pdf_y).
So assuming that is the case, the problem is that indeed Scikit-Learn's DecisionTreeClassifier does support Multi-output problems, this is, y inputs with shape [n_samples, n_outputs]. However that is not the case for the AdaBoostClassifier, given that, from the documentation, the labels must be:
y : array-like of shape = [n_samples]

Sklearn svr give wrong results when the training data obvious show a pattern

I have the following training data:
x = [
[0.914728682,5.217,5,0.217,3.150362319,33.36,35,-1.64,4.220113852],
[0.885057471,7.793,8,-0.207,3.380911063,46.84,48,-1.16,4.448243115],
[0.871345029,7.152,7,0.152,3.976205037,44.98,47,-2.02,5.421236592],
[0.821428571,8.04,8,0.04,2.909880565,52.02,54.5,-2.48,2.824104235],
[0.931372549,8.01,8,0.01,4.616714697,48.04,48,0.04,9.650462033],
[0.66367713,5.424,5.5,-0.076,1.37804878,32.6,35.5,-2.9,1.189781022],
[0.78,8.66,9,-0.34,2.272965879,48.47,55,-6.53,2.564550265],
[0.227272727,19.55,21,-1.45,1.860133206,128.23,147,-18.77,1.896893491],
[0.47826087,10.09,8,2.09,1.155519927,74.43,64,10.43,1.169547454],
[0.652694611,6.775,4,2.775,1.05529595,43.1,30,13.1,1.062885327],
[0.798561151,3.986,2,1.986,0.656563993,25.38,13,12.38,0.652442159],
[0.666666667,5.419,3,2.419,1.057985162,34.37,16,18.37,0.981719509],
[0.5625,7.719,2,5.719,0.6421797,46.91,12,34.91,0.665673336]
]
and the following labels(scores):
y = [0.237113402,0.168831169,0.104166667,0.086419753,0.063147368,0.016042781,
0.014814815,0,0,-0.0794,-0.14,-0.1832,-0.2385]
It seems clear that the larger the values in column 5 and column 9 are, the higher the scores.
I write the following code that make use of SVR on the training data provided:
rb = RobustScaler()
xScaled = rb.fit_transform(x)
model = SVR(C=1.0, epsilon=0.1)
model.fit(xScaled,y)
But no matter which of the following I use for prediction, it is not giving a score that looks right.
1 score = model.predict(rb.fit_transform(testData))
2 score = model.predict(testData)
If I do something like the following during training:
xScaled = preprocessing.scale(x)
model = SVR(C=1.0, epsilon=0.1)
model.fit(xScaled,y)
then:
score = svmModel.predict(testData)
I get back something close to the origin y.
But I pick a row in x, put it in a 2d array with one row called testData, and do:
score = svmModel.predict(testData)
I get a wrong score. In fact, no matter which row in x I use for creating the 2d array with one row, I get the same score.
What have I done wrong? I would be extremely grateful if someone can help.
1) score = model.predict(rb.fit_transform(testData))
When you do the above, you are re-fitting the RobustScaler to the new data. That means that it will be scaled to new data and will not match the scales of the training data. So the results will not be good.
2) score = model.predict(testData)
In the above, you are not scaling the test data, so its different that what the SVC has learnt. Hence the results will be bad here also.
What you need to do:-
score = model.predict(rb.transform(testData))
Calling transform() will scale the supplied data based on training data scales, and hence the SVC can better predict the output.

Is it possible to use a different data set as input for prediction in AdaBoostRegressor (sklearn)?

I am just beginner of machine learning and I am playing sklearn now.
I copied the example of AdaBoostRegressor from official site at here and added the following.
X_pred = np.linspace (6, 12, 100)[:, np.newaxis]
y_pred = regr_2.predict(X_1)
As the training data set X is ranged from 0 to 6, I am trying to get a prediction for a different data set X_pred ranged from 6 to 12.
However, I found that the value of y_pred is always -1.05382839 which is the last value of training set output y.
I am wondering whether it is possible to use a non-input-sample data set as the input of prediction.
Is it possible to do that? If so what is the correct usage?
BTW, attached pic is the output.
Red and green are the predicated output based on training set input (0-6) and blue is the output of X_pred (6 - 12).
In short - no. This is not what regression is about. Regression is about interpolation, not extrapolation. Pretty much none of the regressors can make any predictions about data outside of the training set.

sklearn's GradientBoostingRegressor gives the same prediction for different inputs

I encountered a weird behavior while trying to train sklearn's GradientBoostingRegressor and make prediction. I will bring an example to demonstrate the issue on a reduced dataset but issue remains on a larger dataset as well. I have the following 2 small datasets adapted from a big dataset. As you can see the target variable is identical for both cases but input variables are different though their values are close to each other. The target variables(Y) are in the last column.
I have the following code:
d1 = {'0':[101869.2,102119.9,102138.0,101958.3,101903.7,12384900],
'1':[101809.1,102031.3,102061.7,101930.0,101935.2,11930700],
'2':[101978.0,102208.9,102209.8,101970.0,101878.6,12116700],
'3':[101869.2,102119.9,102138.0,101958.3,101903.7,12301200],
'4':[102125.5,102283.4,102194.0,101884.8,101806.0,10706100],
'5':[102215.5,102351.9,102214.0,101769.3,101693.6,10116900]}
data1 = pd.DataFrame(d1).T
X1 = data1.ix[:,:4]
Y = data1[5]
d2 = {'0':[101876.0,102109.8,102127.6,101937.0,101868.4,12384900],
'1':[101812.9,102021.2,102058.8,101912.9,101896.4,11930700],
'2':[101982.5,102198.0,102195.4,101940.2,101842.5,12116700],
'3':[101876.0,102109.8,102127.6,101937.0,101868.4,12301200],
'4':[102111.3,102254.8,102182.8,101832.7,101719.7,10706100],
'5':[102184.6,102320.2,102188.9,101699.9,101548.1,10116900]}
data2 = pd.DataFrame(d2).T
X2 = data2.ix[:,:4]
Y = data2[5]
re1 = ensemble.GradientBoostingRegressor(n_estimators=40,max_depth=None,random_state=1)
re1.fit(X1,Y)
pred1 = re1.predict(X1)
re2 = ensemble.GradientBoostingRegressor(n_estimators=40,max_depth=None,random_state=3)
re2.fit(X2,Y)
pred2 = re2.predict(X2)
where
X1 is a pandas DataFrame corresponding to Column 1 through Column 5 on the 1st dataset
X2 is a pandas DataFrame corresponding to Column 1 through Column 5 on the 2nd dataset
Y represents the target column.
The issue I am facing is that I cannot explain why pred1 is exactly the same as pred2?? As long as X1 and X2 are not the same pred1 and pred2 must also be different, musn't they? Help me to find my false assumption, please.
What you observe is perfectly expected.
You fit a high complexity estimator to the data (max_depth=None), so it is easy to learn all of the data by heart, that is overfit completely on the training data.
Then the prediction will be whatever labels you gave for training.
Have a look at Peter's talk here about how to tune GradientBoosting correctly:
https://www.youtube.com/watch?v=-5l3g91NZfQ
Anyhow, you should at least have a test-set.
My guess is that since you are fitting X1 and X2 to the same Y, it is reasonable that pred1 and pred2 are similar. When your Regressor is very powerful (can fit anything to anything) or your problem is too easy (can be fitted exactly by your regressor), then pred1 and pred2 will be both equal to Y.

Categories

Resources