How to scale the input data for trained model? - python

I have a trained model that uses regression to predict house prices. It was trained on a standardized dataset (StandatdScaler from sklearn). How do I scale my models input (a single example) now in a different python program? I can't use StandardScaler on the input, because all features would be reduced to 0 (MinMaxScaler doesn't work either, also tried saving and loading scaler from the training script - didn't work). So, how can I scale my input so that features won't be 0 allowing the model to predict the price correctly?

What you've described is a contradiction in terms. Scaling refers to a range of data; a single datum does not have a "range"; it's a point.
What you seem to be asking is how to scale the input data to fit the translation you made when you trained. The answer here is straightforward again: you have to use the same translation function you applied when you trained. Standard practice is to revert the model's ingestion (i.e. reverse that scaling function); if you didn't do that, and you didn't make any note of that function's coefficients, then you do not have the information needed to apply the same translation to future input -- in short, your trained model isn't particularly useful.
You could try to recover the coefficients by running the scaling function on the original data set, making sure to output the resulting function. Then you could apply that function to your input examples.

Related

LightGBM: train() vs update() vs refit()

I'm implementing LightGBM (Python) into a continuous learning pipeline. My goal is to train an initial model and update the model (e.g. every day) with newly available data.
Most examples load an already trained model and apply train() once again:
updated_model = lightgbm.train(params=last_model_params, train_set=new_data, init_model = last_model)
However, I'm wondering if this is actually the correct way to approach continuous learning within the LightGBM library since the amount of fitted trees (num_trees()) grows for every application of train() by n_estimators. For my understanding a model update should take an initial model definition (under a given set of model parameters) and refine it without ever growing the amount of trees/size of the model definition.
I find the documentation regarding train(), update() and refit() not particularly helpful. What would be considered the right approach to implement continuous learning with LightGBM?
In lightgbm (the Python package for LightGBM), these entrypoints you've mentioned do have different purposes.
The main lightgbm model object is a Booster. A fitted Booster is produced by training on input data. Given an initial trained Booster...
Booster.refit() does not change the structure of an already-trained model. It just updates the leaf counts and leaf values based on the new data. It will not add any trees to the model.
Booster.update() will perform exactly 1 additional round of gradient boosting on an existing Booster. It will add at most 1 tree to the model.
train() with an init_model will perform gradient boosting for num_iterations additional rounds. It also allows for lots of other functionality, like custom callbacks (e.g. to change the learning rate from iteration-to-iteration) and early stopping (to stop adding trees if performance on a validation set fails to improve). It will add up to num_iterations trees to the model.
What would be considered the right approach to implement continuous learning with LightGBM?
There are trade-offs involved in this choice and no one of these is the globally "right" way to achieve the goal "modify an existing model based on newly-arrived data".
Booster.refit() is the only one of these approaches that meets your definition of "refine [the model] without ever growing the amount of trees/size of the model definition". But it could lead to drastic changes in the predictions produced by the model, especially if the batch of newly-arrived data is much smaller than the original training data, or if the distribution of the target is very different.
Booster.update() is the simplest interface for this, but a single iteration might not be enough to get most of the information from the newly-arrived data into the model. For example, if you're using fairly shallow trees (say, num_leaves=7) and a very small learning rate, even newly-arrived data that is very different from the original training data might not change the model's predictions by much.
train(init_model=previous_model) is the most flexible and powerful option, but it also introduces more parameters and choices. If you choose to use train(init_model=previous_model), pay attention to parameters num_iterations and learning_rate. Lower values of these parameters will decrease the impact of newly-arrived data on the trained model, higher values will allow a larger change to the model. Finding the right balance between those is a concern for your evaluation framework.

Getting confidence intervals from an Xgboost fitted model

I am trying to get the confidence intervals from an XGBoost saved model in a .tar.gz file that is created using python XGBoost library.
The problem is that the model has already been fitted, and I dont have training data any more, I just have inference or serving data to predict. All the examples that I found entail using a training and test data to create either quantile regression models, or bagged models, but I dont think I have the chance to do that.
Why your desired approach will not work
I assume we are talking about regression here. Given a regression model that you cannot modify, I think you will not be able to achieve your desired result using only the given model. The model was trained to calculate a continuous value that appoximates some objective value (i.e., its true value) based on some given input. Nothing more.
Possible solution
The only workaround I can think of would be to train two more models. These model's training goal would be to predict the quality of the output of your given model. One would calculate the upper bound of a given (i.e., predefined by you at training time) confidence interval and the other one the lower bound. This would probably include a lot of feature engineering. One would probably like to find features that correlate with the prediction quality of the original model.

Creating train,test data for Word2Vec model

I am trying to create a W2V model and then generate train and test data to be used for my model.My question is how can I generate test data after I am done with creating a W2V model with my train data.
Word2Vec is considered an 'unsupervised' algorithm, so at least during its training, it is not typical to hold back any 'test' data for later evaluation.
A Word2Vec model is usually then evaluated on how well it helps some other process - such as the analogy-solving highlighted by the original paper. In gensim, the method [evaluate_word_analogies()][1] can repeat that process. But note: word-vectors that perform best on word-analogies my not be best for other purposes, like classification or info-retrieval. It's always best to evaluate & tune your word-vectors in a repeatable way that's related to your actual underlying use.
(If you're using the Word2Vec model's outputs - word-vectors specific to your domain – as part of a larger system, where some steps should be evaluated with held-back data, the decision of whether to train the Word2Vec component on all data could go either way, depending on other considerations.)

How to structure and size Y-labels for multivariate sequence prediction using Keras LSTMs

I am working on a sequence prediction problem where my inputs are of size (numOfSamples, numOfTimeSteps, features) where each sample is independent, number of time steps is uniform for each sample (after pre-padding the length with 0's using keras.pad_sequences), and my number of features is 2. To summarize my question(s), I am wondering how to structure my Y-label dataset to feed the model and want to gain some insight on how to properly structure my model to output what I want.
My first feature is a categorical variable encoded to a unique int and my second is numerical. I want to be able to predict the next categorical variable as well as an associated feature2 value, and then use this to feed back into the network to predict a sequence until the EOS category is output.
This is a main source I've been referencing to try and understand how to create a generator for use with keras.fit_generator.
[1]
There is no confusion with how the mini-batch for "X" data is grabbed, but for the "Y" data, I am not sure about the proper format for what I am trying to do. Since I am trying to predict a category, I figured a one-hot vector representation of the t+1 timestep would be the proper way to encode the first feature, I guess resulting in a 4? Dimensional numpy matrix?? But I'm kinda lost with how to deal with the second numerical feature.
Now, this leads me to questions concerning architecture and how to structure a model to do what I am wanting. Does the following architecture make sense? I believe there is something missing that I am not understanding.
Proposed architecture (parameters loosely filled in, nothing set yet):
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model.fit_generator(...) #ill figure this out
So, at the end, a softmax activation can predict the next categorical value for feature1. How do I also output a value for feature2 so that I can feed the new prediction for both features back as the next time-step? Do I need some sort of parallel architecture with two LSTMs that are combined somehow?
This is my first attempt at doing anything with neural networks or Keras, and I would not say I'm "great" at python, I can get by though. However, I feel I have a decent grasp at the fundamental theoretical concepts, but lack the practice.
This question is sorta open ended, with encouragement to pick apart my current strategy.
Once again, the overall goal is to predict both features (categorical, numeric) in order to predict "full sequences" from intermediate length sequences.
Ex. I train on these padded max-len sequences, but in production I want to use this to predict the remaining part of the currently unseen time-steps, which would be variable length.
Okay, so If I understand you properly (correct me if I'm wrong) you would like to predict next features based on the current ones.
When it comes to categorical variables, you are on point, your Dense layer should output N-1 vector containing probability of each class (while we are at it, if you, by any chance, use pandas.get_dummies remember to specify argument drop_first=True, similiar approach should be employed whatever you are using for one-hot encoding).
Except those N-1 output vector for each sample, it should output one more number for numerical value.
Remember to output logits (no activation, don't use softmax at the end like you currently do). Afterwards network output should be separated into N-1 part (your categorical feature) and passed to loss function able to handle logits (e.g. in Tensorflow it is tf.nn.softmax_cross_entropy_with_logits_v2 which applies numerically stable softmax for you).
Now, your N-th element of network output should be passed to different loss, probably Mean Squared Error.
Based on loss value of those two losses (you could take a mean of both to obtain one loss value), you backpropagate through the network and it might do just fine.
Unfortunately I'm not skilled enough in Keras in order to help you with the code, but I think you will figure it out yourself. While we're at it, I would like to suggest PyTorch for more custom neural networks (I think yours fits this description), though it's definitely doable in Keras as well, your choice.
Additional 'maybe helpful' thought: you may check Teacher Forcing for your kind of task. More on the topic and theory behind it can be found in the outstanding Deep Learning Book and code example (though in PyTorch once again), can be found in their docs here.
BTW interesting idea, mind if I use it in connection with my current research trajectory (with kudos going to you of course)? Comment on this answer if so we can talk it out in chat.
Basically every answer I was looking for was exampled and explained in this tutorial. Absolutely great resource for trying to understand how to model multi-output networks. This one goes through a lengthy walkthrough of a multi-output CNN architecture. It only took me about three weeks to stumble upon, however.
https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?
1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.

Categories

Resources