Unexpected Number of Weights in tf.keras Sequential Model - python

I have a question about the predictive power of each feature so I need a way to evaluate how strong each feature is in the final model. My feature_layer contains two indicator_columns wrapped around categorical_column_with_vocabulary_lists for categorical data, an indicator_column wrapped around a cross between two bucketized numerical columns for latitude/longitude data, and five numeric columns.
I would expect the finished model to have 15 weights: 2 for the latitude and longitude, 5 for the numeric columns, and 5 and 3 for each of the categorical columns using one-hot encoding. However, len(model.get_weights())[0] returns 513. I suspect the latitude and longitude have many more weights since a cross between two bucketized columns ends up being a sparse categorical feature with a high enough resolution. However, assuming this is true, I still don't know how to interpret the weights returned by model.get_weights()[0].

I found out that the answer has to do with the hash_bucket_size argument in the crossed_column. Each of those hashes gets a weight of its own in the final model. The 513 weights were a result of the 13 weights from every other feature and the 500 hashes for the crossed latitude/longitude.
In terms of interpreting the weights, I am under the assumption that the weights of the model remain in the order that I added features to the feature_layer.

Related

Keras evaluate Accuracy w.r.t. a feature

I have a dataset that consists of different features, like "gender". The task of the model is to determine if the annual income is above or below 50k.
Let say I have a trained network that does the classification.
Now I want to see how often the classifier makes false positive respectively false negative predictions by grouping them accordingly to the gender feature.
The basic idea is a confusion matrix of some sorts, but not a matrix of class to class but class to feature.
The image below illustrates the result I would like to have.
The basic idea is as follows:
1)Make a prediction with the Network.
2)Set the predicted values as new column in your Dataset, you now have a new dataset data_new
Your dataset now has two columns, one for the predicted and one for the true values. You can calculate the overall accuracy by boolean comparison (1 and 1 is right prediction and 0 and 1 and 1 and 0 are wrong predictions respectively).
3)Now you can filter the new data for any column you want, so in my case for the specific gender.
4)Now you can calculate the accuracy w.r.t to chosen gender.

How to predict on test set in matrix factorization collaborative filtering

Say I have df that looks like this
userId movieId rating
0 1 31 1
1 1 34 5
2 1 742 2
3 1 1013 4
4 2 31 1
...
I've splitted using stratified sampling to keep same user in both train/test set.
When training on train dataset I would usually initialize embedding matrices for user and movies and try to learn using SGD.
After two matrices learned say P, Q. I take dot_product(P_i, Q.T_j) to get prediction for (i,j)th position in rating's matrix.
Since P,Q are learned embedding seems correct to use this learned embedding to predict validation dataset. However simply validation_dataset - dot_product(P,Q) doesn't make sense because shape of train and valid dataset are different.
One way to do is from original dataset take-out known ratings and keep it as validation set. However I am wondering if there is a way to split data first then apply learned embeddings to predict test set (this seems more intuitive to me however do not know how to do it...)
The most widely accepted method to calculate test-set performance on collaborative-filtering systems is to keep some number of known user-item interactions separate, in the form of a test set. We exclude those test-set interactions from the training set, which is used to train the model.
After training, for each pair of user u and item i in the test set, we compute the model's predicted interaction score for u and i, and compare it with the known interaction score, which is either 1 or 0 (0 when negative-sampling is used). This is how we compute the test-set performance metrics.
If you use the model's predicted ratings/scores to create new data-points for the test-set, then it may not reflect the true generalization performance of the model on completely unseen/new data. Let me know if that answers your question, or if any clarifications are needed.

Using SHAP: Scaling the Shapley values for each model and then averaging across models, or just adding the Shapley values for each model?

I'm running n trials on a Keras model with k features, after which I apply SHAPs DeepExplainer to the model on each trial. All the data is the same, but it is randomly split between the training and testing sets. I'm trying to figure out the best way to combine the model outputs, whether it be directly by adding the Shapley values for each trial, feature by feature, and then averaging - or by scaling the Shapley values output each trial first and then adding them and averaging.
My initial thought was that, as the "baseline is always relative based on the average of all predictions" (from here), the overall average would be skewed and there might be a better way of combining the data. Though I wonder if, despite the different samples in the train/test split and a different relative "baseline" for each model, if averaging over many models would give a final averaged model should have as much interpretation value as a single model. Should this be the case?
However, would scaling the features per model offer any benefits: Again from here I can (save for the caveats) scale a features Shapley values for a single observation in a model. It seems then that I should be able to scale each of the features Shapley values after summing over all observations, over each bin such that all Shapley values for each feature sum to 1. If this is the case, that I can scale by features within the model can I average the models this way? I am thinking a benefit of this is that all models will then have equal weight since the features are scaled within each. Is this a valid approach, and if so, does it offer any benefit over adding all the Shapley values, feature by feature together over all models?
To be clear on what mean concerning the bins, they are the the lists returned from the explainer, equal to the number of classifications:
explainer = shap.DeepExplainer(model, X_train)
ShapleyBinVals = explainer.shap_values(X_test)
Bin = ShapleyBinVals[n]
where n is the number of output classifications. Here's a bar plot of the scaled output:
Notice that for each feature e.g. PSWQ_2 the y-value is a percentage and the sum of percentages over all bins is 1.

Data missing imputation with autoencoder on small set of data

I have to model a ANN to predict the level of consumer complains regarding the in-process parameters on the chain production for my master thesis. Unfortunately, the firm gives me unregulated collected data and there are a lot of missing data. It's about a year of data grouped by open day, so I have 17 column of physical values for 260 days. To infer the missing values, I try to model an denoising autoencoder but it doesn't provide good results. For training the model, I have only 113 days with complete data. The values are real-valued, with different unit and range (some are in the range (100,150) and others are in (90.03,90.35)).
To simulate noise and like the missing dynamic is Not Random At All, I modify a value, with this condition (Random.random()
def DAE(train,l1,l2,num_layer):
input_size = train.shape[1]
#num_layer = 2
theta = 1 #int(input_size/num_layer)
code_size = input_size-theta*(num_layer+1)
epochss=1000
lrr=0.01
autoencoder = Sequential()
autoencoder.add(Dense(input_size, input_shape=(input_size,), kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01)))
for index in range(num_layer):
layer_size=input_size-(index+1)*theta
autoencoder.add(Dense(layer_size,input_shape=(input_size,),activation='linear'))
print(layer_size)
autoencoder.add(Dense(code_size,activation='linear'))
print(code_size)
for index in range(num_layer):
layer_size=input_size-(num_layer-index)*theta
autoencoder.add(Dense(layer_size,input_shape=(input_size,),activation='linear'))
print(layer_size)
autoencoder.add(Dense(input_size,activation='linear'))
autoencoder.compile(Adam(lr=lrr), loss='mean_squared_error', metrics=['accuracy'])
return autoencoder
autoencoder = DAE(AE_train,l1,l2,3)
history = autoencoder.fit(AE_train,AE_target,epochs=1000,validation_split=0.2)
On train and test loss plot, it converge really fastly but after a certain number of epochs it appears a big peak with log decay just after. I don't understand why it rise.
When I try to predict the missing values, I change the nan by the mean of the column. The prediction is always out of the min max range of the specific physical values.
So here are my question, how can I deal with missing data in a small set of values ? Here I have different type of values(unit), should I normalize the values ? But if a do that, how to reconstruct them, as I want to infer real value. Is there a better solution for missing data imputation than autoencoder in ML family techniques?
Thanks for reading my problem and even more for bring me an answer.
Loss plot for test and train sets

How to use Cross Validation for Multioutput Regressor in Sci-kit?

first my setup:
X is my feature table. It has 150 000 features and 96 samples. So 150 000 columns and 96 rows.
y is my target table. It has 4 labels and of course 96 samples. So 4x96 (columns x rows).
After splitting into train and test data I'm using MLPRegressor. Based on the documentation of Sci-kit it is an native multioutput regressor. So I can use it to predict my four desired output values with a new sample of 150 000 features .
My code:
mlp = MLPRegressor(hidden_layer_sizes=(2000, 2000), solver= 'lbfgs', max_iter=100)
mlp.fit(X_train,y_train)
And then I'm using cross validation.
cross_validation.cross_val_score(mlp, X, y, scoring='r2')
The output is a list with 3 entries (parameter cv=3).
I don't really get how my 4 labels get represented by these 3 values.
I expected something in a format like this:
label 1: 3 entries, label 2: 3 entries and the same with label 3 and 4.
So I'm getting the R^2-Value for all my labels three times for different splittings of test and train data.
Am I missing something? Do I need to use Multioutputregressor?
(See doc here)
And Here the documentation of cross validation.
Thanks.
First thing is if you are actually using cross_validation.cross_val_score(), then you should replace that with model_selection.cross_val_score(). Module cross_validation has been deprecated and removed from latest version of scikit.
Now coming to why you are only getting a single score for all your outputs and not individual entries is because thats how the default value of scorer is set.
You have used scoring 'r2' which is documented here. In that, there is an option to change the result if the input is multi-output (as your case) by using the
multioutput :
Defines aggregating of multiple output scores. Array-like value
defines weights used to average scores. Default is “uniform_average”.
‘raw_values’ : Returns a full set of scores in case of multioutput
input.
‘uniform_average’ : Scores of all outputs are averaged with uniform
weight.
‘variance_weighted’ : Scores of all outputs are averaged, weighted by
the variances of each individual output.
You see that the default value is 'uniform_average', which just averages all the outputs to get a single value, which is what you are getting.

Categories

Resources