We have a dataset with ~40000 data points each having 160 features. We know nothing about what each feature represents, but they are 0-5 integers, most probably some rankings. Our task is to take a subset of those features, lets say (40000,30) and predict the initial (40000,160) data. In other words, we need to create a model, that takes 30 features as input and outputs the full 160 set of features.
https://i.stack.imgur.com/Ko6nR.png
the example of the dataset.
What we have done so far, we trained a ANN with the following architecture:
30->200->150->163
We are calculating an accuracy score by rounding the prediction(lets say I predicted 3.6 for 4, 3.6~4, 4==4, so True)
We got ~52% accuracy and nothing makes it go higher.
So, the problem is a multi-output regression problem. The prediction is done using 30 discrete numeric features. The normalization was done both by using Min-Max Scaling and Standardization(The target is also normalized). In the model, we tried different number of layers with different capacity, tried to use batch-norm, different activations (relu is used now, for the output layer no activation is used), different losses (mse is the current one), different optimizers (adam is the current one). Both Keras and PyTorch is used in the case something is wrong with the PyTorch implementation.
So, the accuracy still remains 50-52%. There is one straightforward thing - when we increase the model capacity (the number of parameters) the model is more prone to overfitting. Even after increasing the model capacity very very much, we couldn't make the model overfit the data. We tried to use the features separately (For example, predict one feature from another) - nothing useful. Tried to predict 1 feature using 159 features, but again ~52% and even less.
What I understand and can conclude from these - there is no relationship between those ratings and most of them can't predict others. What do you think about this case?
Related
I'm trying to solve a regression problem using a Python Keras CNN (Tensorflow as the backbone), where I try to predict a single y-value based on an 8-dimensional satellite image (23x45 pixels) that I have fetched from Google Earth Engine using their Python API. I currently have 280 images that I augment to get 2500 images using flipping and random noise. The data is normalized & standardized and I have removed outliers and images with only zeros.
I've tested numerous CNN-architecture, for example, this:
(Convolution2D(4,4,3),MaxPooling2D(2,2),dense(50),dropout(0.4),dense(30),dropout(0.4),dense(1)
This results in a weird behaviour where the predicted value is in mainly two distinct groups or clusters (where each group has very little variance). The true value has a much higher variance. See image below.
I have chosen not to publish any code snippets as my question is more of a general nature. What might lead to such clustered predictions? Are there any good common tricks to improve the results?
I've tried to solve the problem using a normal neural network and regression tools from SciKit-Learn, by flattening the images to one long array (length 23x45x8 = 8280). That doesn't result in clustering, although the accuracy is still quite low. I assume that is due to insufficient or inappropriate data.
Plotted Truth (x) vs Prediction (y) which shows that the prediction is heavily clustered
your model is quite simple, it cannot even properly extract feature, so i guess it is under fit. and your dropout is 40% in 2 layers, which quite high for such small network. you also have linear activation, it seems that way.
and yes number of sample can also have effect on group prediction, mostly class with majority of samples is chosen.
i have also noticed some of your truth values are greater than 1 and less than 0. you have to normalize properly and use proper activation function.
I am performing sentiment analysis on a dataset of Movie Reviews. The neural network is a single-hidden layer NN, made from scratch in Python. The classifier is expected to assign one of five classes(0 to 4) to each review phrase. however, upon training, the confusion matrix for the dev set gives the following results:
This means that the classifier is heavily biased towards class 0 and class 4. What could be the possible reasons?
The classifier earlier predicted only class 2 always because the dataset was skewed (~ 50% of the data was from class 2). Hence I chose a subset of the dataset containing an equal number of examples from all 5 classes. I still don't understand the output and low accuracy.
The link to my notebook can be found here
first of all your model is linear, with only 1 layer. so its simple model which might not produce good results, try increasing number of layers.
you training cost is also very high, you have to train for more epochs until you get good training cost. which also affect your validation cost which is twice the training cost.
it is sign of over fitting.
I am reading through the example of calibrating probabilities from h2o documentation http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/calibrate_model.html
Since the example is poorly explained, my question is:
do we have to have weights (weights_column) in the training set?
if so, what do these weights do?
I also tried to leave out the weights. The code still runs but the results are very different. Any insight would be appreciated
If you are interested in how the weights column in H2O-3 works you can review the documentation here and code examples here.
I am included a copy of this document for your convenience:
weights_column
Available in: GBM, DRF, Deep Learning, GLM, AutoML, XGBoost, CoxPH
Hyperparameter: no
Description
This option specifies the column in a training frame to be used when determining weights. Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are also supported. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing). Note that if you omit the weights, then all observations will have equal weights (set to 1) for the computation of the metrics. For example, a weight of 2 is identical to duplicating a row.
Notes:
Weights can be specified as integers or as non-integers.
The weights column cannot be the same as the fold_column.
If a weights column is specified as both a feature (predictor) and a weight, the column will be used for weights only.
Example unit test scripts are available on GitHub:
https://github.com/h2oai/h2o-3/blob/master/h2o-py/tests/testdir_algos/gbm/pyunit_weights_gbm.py
https://github.com/h2oai/h2o-3/blob/master/h2o-py/tests/testdir_algos/gbm/pyunit_weights_gamma_gbm.py
Observation Weights in Deep Learning
The observation weights are handled differently in Deep Learning than in the other supported algorithms. For algorithms other than Deep Learning, the weight goes into the split-finding and leaf-node prediction math in a straightforward way. For Deep Learning, it’s more difficult. Using the weight as a multiplicative factor in the loss will not work in general, and that would not be the same as replicating the same row. Also, applying the same row over and over isn’t a good idea either, so sampling during training should still be active. To address these issues, Deep Learning is implemented with importance sampling using the inverse cumulative distribution function. It also includes a special case that picks a random row from the dataset for every second row it trains, just to keep outliers in the game. Note that observation weights for Deep Learning that are neither 0 nor 1 are difficult to handle properly. In this case, it might be better to explicitly oversample using balance_classes=TRUE.
Related Parameters
balance_classes
offset_column
y
I am training a neural network model, and my model fits the training data well. The training loss decreases stably. Everything works fine. However, when I output the weights of my model, I found that it didn't change too much since random initialization (I didn't use any pretrained weights. All weights are initialized by default in PyTorch). All dimension of the weights only changed about 1%, while the accuracy on training data climbed from 50% to 90%.
What could account for this phenomenon? Is the dimension of weights too high and I need to reduce the size of my model? Or is there any other possible explanations?
I understand this is a quite broad question, but I think it's impractical for me to show my model and analyze it mathematically here. So I just want to know what could be the general / common cause for this problem.
There are almost always many local optimal points in a problem so one thing you can't say specially in high dimensional feature spaces is which optimal point your model parameters will fit into. one important point here is that for every set of weights that you are computing for your model to find a optimal point, because of real value weights, there are infinite set of weights for that optimal point, the proportion of weights to each other is the only thing that matters, because you are trying to minimize the cost, not finding a unique set of weights with loss of 0 for every sample. every time you train you may get different result based on initial weights. when weights change very closely with almost same ratio to each others this means your features are highly correlated(i.e. redundant) and since you are getting very high accuracy just with a little bit of change in weights, only thing i can think of is that your data set classes are far away from each other. try to remove features one at a time, train and see results if accuracy was good continue to remove another one till you hopefully reach to a 3 or 2 dimensional space which you can plot your data and visualize it to see how data points are distributed and make some sense out of this.
EDIT: Better approach is to use PCA for dimensionality reduction instead of removing one by one
I am building a predictive model where I want to know can I predict whether a package will be delivered on time (Binary Yes / No), in the event that the package is not delivered on time, I wish to be able to predict by when it will be delivered in categories of <7days, <14days, <21days >28days after expected date.
I have built and tested a model for binary classification and have got an f Score of 0.92, which is satisfactory for my needs. However, when I train my categorical model, I start to see training accuracy and validation accuracy diverge (training accuracy is much better than validation accuracy). This is a sign of overfitting.
However, I have tried regularization and different values, plus using dropout and different values, and the validation accuracy never gets above 0.7. My total training set is of ~10k examples, ~3k validation, and whilst the catgorical spread is not equal there are sufficient examples of each category (I think). I am using a NN and have increased / decreased both layers and activations and still no joy
Any thoughts on where to go next. Thanks
Because you are using NN, introduce dropout layers. See if it can help to reduce the overfitting problem. And also checkout this How to choose the number of hidden layers and nodes in a feedforward neural network?
The more complex the network (hidden layers, number of neurons in them), also contribute to overfitting problem
The approach we have chosen is to carry out a linear regression with the expected duration as target variable. We have excluded some outliers, and then taken the differences between the actual and predicted days. We then max'd and min'd the difference and we now have a prediction with a tolerable range. We will keep working on the other techniques to see if we can improve. Thanks to everyone who suggested ideas