I am doing a model training of CNN in Python and I have a question. I know that data normalization is important to scale the data in my dataframe between 0 and 1, but let's say I perform z-score normalization on my dataframe VERTICALLY (which means scale the data within the scope of each feature), but after I deployed the model and want to use it on real world scenarios, I only have one row of data in my dataframe (but with same amount of features), I am not able to perform normalization anymore because there is only one data for each feature. The standard deviation will be 0 and division of 0 in z-score is not applicable.
I want to confirm that do I still need to perform data normalization on real world scenarios? If I do not need to, will the result differs because I did normalization during my model training?
If you are using a StandardScaler from scikit-learn. You need to save the scaler object and use it for transforming new data after deployment.
Related
I am training a neural network model for time series forecasting. I need to scale my data as the model should be able to receive different time series with different value ranges as input. So far, I have only tried using a MinMaxScaler. I have a limited number of data points to fit my scaler on as the individual time series are quite small, and I need the model to make predictions on most of the data. My time series are quite volatile so as new data points are transformed by the scaler they often exceed the range of the the scaler (above 1, below 0). This is a problem - specifically when values fall below 0. I know I can adjust the range of the MinMaxScaler, but that doesn't seem like a good solution.
Is it a good idea to scale the entire dataframe every time a new data point arrives? Or maybe just scale and transform the number of data points that the model use to predict (the window size)?
If not, how do you solve the issue of having little data to fit the scaler on? Clipping values is not an option as it looses a key component, namely the difference in value between the data points. I am not convinced that StandardScaler or RobustScaler will do the trick either for the same reason.
I have a set of alphanumeric categorical features (c_1,c_2, ..., c_n) and one numeric target variable (prediction) as a pandas dataframe. Can you please suggest to me any feature selection algorithm that I can use for this data set?
I'm assuming you are solving a supervised learning problem like Regression or Classification.
First of all I suggest to transform the categorical features into numeric ones using one-hot encoding. Pandas provides an useful function that already does it:
dataset = pd.get_dummies(dataset, columns=['feature-1', 'feature-2', ...])
If you have a limited number of features and a model that is not too computationally expensive you can test the combination of all the possible features, it is the best way however it is seldom a viable option.
A possible alternative is to sort all the features using the correlation with the target, then sequentially add them to the model, measure the goodness of my model and select the set of features that provides the best performance.
If you have high dimensional data, you can consider to reduce the dimensionality using PCA or another dimensionality reduction technique, it projects the data into a lower dimensional space reducing the number of features, obviously you will loose some information due to the PCA approximation.
These are only some examples of methods to perform feature selection, there are many others.
Final tips:
Remember to split the data into Training, Validation and Test set.
Often data normalization is recommended to obtain better results.
Some models have embedded mechanism to perform feature selection (Lasso, Decision Trees, ...).
Say I have created a randomforest regression model using test/train data available with me.
This contains feature scaling and categorical data encoding.
Now if I get a new dataset on a new day and I need to use this model to predict the outcome of this new dataset and compare it with the new dataset outcome that I have, do I need to apply feature scaling and categorical data encoding on this dataset as well?
For example .. day 1 I have 10K rows with 6 features and 1 label -- a regression problem
I built a model using this.
On day 2, I get 2K rows with same features and label but of course the data within it would be different.
Now I want to firstly predict using this model and day 2 data, what should be the label as per my model.
Secondly, using this result I want to compare the outcome of the model against the day 2 original label that I have.
So in order to do this, when I pass the day 2 features as the test set to the model, do I need to first do feature scaling and categorical data encoding on them?
This is somewhat to do with making predictions and validating with the received data in order to assess the data quality of the received data.
You always need to pass the data to the model in the format it is expecting them. If the model has been trained on scaled, encoded, ... data. You need to do perform all these transformations every time you are pushing new data into the trained model (for whatever reason).
The easiest solution is to use sklearn's Pipeline to create a pipeline with all those transformations included and then use it, instead of the model itself to make predictions for new entries so that all those transformations are automatically applied.
example - automatically applying StandardScaler's scaling feature before passing data into the model:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
// then
pipe.fit...
pipe.score...
pipe.predict...
The same holds for dependent variable. If you scaled it before you trained your model, you will need to scale the new ones as well, or you will need to apply inverse operation on the output of the model before you compare it with the original dependent variable values.
Overview : I'm new to ML and learning sklearn preprocessing. I figured out that mean will not be 0 and std will not be 1 when we use sklearn preprocessing transform on TEST data (reason being we are using TRAIN data mean/std to standardize the test data).
My question : If the test data is Standardized in this way(not correctly standardized to Gaussian Normal Distribution with mean 0 and std 1), then will this effect the prediction of ML Algorithm? My understanding is that the ML prediction will have low accuracy, as we are giving the ML model an incorrectly standardized data.
Code screenshot for mean and std
What this should be telling you is that your training and test sets might have different distribution. If your training set is not representative of the global population (here represented by TEST data) then the model won't generalise that well.
It's completely OK if your test data isn't centred around zero with 1 std. The point of this transform is to get all data in the same range, as otherwise number of algorithms would incorrectly (with respect to the user intention) update the model. By applying this transform you are saying "all features equally important".
There's no such thing like "incorrectly standardized data" (the way you described), only training data not being representative.
I have a trained model that uses regression to predict house prices. It was trained on a standardized dataset (StandatdScaler from sklearn). How do I scale my models input (a single example) now in a different python program? I can't use StandardScaler on the input, because all features would be reduced to 0 (MinMaxScaler doesn't work either, also tried saving and loading scaler from the training script - didn't work). So, how can I scale my input so that features won't be 0 allowing the model to predict the price correctly?
What you've described is a contradiction in terms. Scaling refers to a range of data; a single datum does not have a "range"; it's a point.
What you seem to be asking is how to scale the input data to fit the translation you made when you trained. The answer here is straightforward again: you have to use the same translation function you applied when you trained. Standard practice is to revert the model's ingestion (i.e. reverse that scaling function); if you didn't do that, and you didn't make any note of that function's coefficients, then you do not have the information needed to apply the same translation to future input -- in short, your trained model isn't particularly useful.
You could try to recover the coefficients by running the scaling function on the original data set, making sure to output the resulting function. Then you could apply that function to your input examples.