one_hot_explicit parameter for h2o python raises error - python

When training a model in h2o v3.10 using the python h2o library, I am seeing an error when trying to set one_hot_explicit as a choice for the categorical_encoding parameter.
encoding = "enum"
gbm = H2OGradientBoostingEstimator(
categorical_encoding = encoding)
gbm.train(x, y,train_h2o_df,test_h2o_df)
Works fine and the model uses enum categorical_encoding, but when:
encoding = "one_hot_explicit"
or
encoding = "OneHotExplicit"
the following error is raised:
gbm Model Build progress: | (failed)
....
OSError: Job with key $03017f00000132d4ffffffff$_bde8fcb4777df7e0be1199bf590a47f9 failed with an exception: java.lang.AssertionError
stacktrace:
java.lang.AssertionError
at hex.ModelBuilder.init(ModelBuilder.java:958)
at hex.tree.SharedTree.init(SharedTree.java:78)
at hex.tree.gbm.GBM.init(GBM.java:57)
at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:159)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:169)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1203)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Is there some dependency I'm missing or is this a bug?

your encoding choice should work, though you may want to update to the latest stable release of H2O. Here is a code snippet you can run that works, and test if it works for you. If it works then you can try and pinpoint the difference between your previous code and the example below.
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "DayOfWeek", "Month", "Distance"]
response = "IsDepDelayed"
# split into train and validation sets
train, valid= airlines.split_frame(ratios = [.8], seed = 1234)
# try using the `categorical_encoding` parameter:
encoding = "one_hot_explicit"
# initialize the estimator
airlines_gbm = H2OGradientBoostingEstimator(categorical_encoding = encoding, seed =1234)
# then train the model
airlines_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# print the auc for the validation set
airlines_gbm.auc(valid=True)

Related

BERT model: What does the error message "TypeError: normalize() argument 2 must be str, not None" mean & how can it be solved?

I try to train a BERT model with manually categorised Twitter data using Python in R Studio (via the reticulate package). The following error messages are shown currently:
Error during wrapup: TypeError: normalize() argument 2 must be str, not None
And:
Error: no more error handlers available (recursive errors?); invoking >'abort' restart
Probably, the problem resulted because the code I used (see below) was mainly copied from the following instruction, while my dataset is different in some ways to the dataset used in the instructions. The instructions I used are from here
Within these instructions the dataset train.csv was used from this page
This dataset contains 45 variables, with which text data is analysed with different measures of toxicity (with values anywhere between 0 and 1).
In my case, however, I use a csv-file with 8 variables (converted from .RData to .csv using the R function write.csv2()) and much less data points. Another difference is that my toxicity measure is dichotomous (with a value of either 0 or 1).
According to the instruction, I specified the model in the following way:
library(reticulate)
k_bert = import('keras_bert')
token_dict = k_bert$load_vocabulary(vocab_path)
tokenizer = k_bert$Tokenizer(token_dict)
# Defining model parameters and column names for training BERT
seq_length = 512L # defining the sequence length
bch_size = 100 # defining the batch (number of data points passed through the model at the time) size
epochs = 1 # defining the number of epochs
learning_rate = 1e-4 # defining the learning rate
DATA_COLUMN = 'comment_text'
LABEL_COLUMN = 'target'
# loading BERT model into R and automatically pad sequences with the "seq_len" function
model = k_bert$load_trained_model_from_checkpoint(
config_path,
checkpoint_path,
training=T,
trainable=T,
seq_len=seq_length)
# function for tokenizing text:
tokenize_fun = function(dataset) {
c(indices, target, segments) %<-% list(list(),list(),list())
for ( i in 1:nrow(dataset)) {
c(indices_tok, segments_tok) %<-% tokenizer$encode(dataset[[DATA_COLUMN]][i],
max_len=seq_length)
indices = indices %>% append(list(as.matrix(indices_tok)))
target = target %>% append(dataset[[LABEL_COLUMN]][i])
segments = segments %>% append(list(as.matrix(segments_tok)))
}
return(list(indices,segments, target))
}
# function for reading data:
dt_data = function(dir, rows_to_read){
data = data.table::fread(dir, nrows=rows_to_read)
c(x_train, x_segment, y_train) %<-% tokenize_fun(data)
return(list(x_train, x_segment, y_train))
}
# reading the training data using the function just created:
c(x_train,x_segment, y_train) %<-%
dt_data('C:/Users/admet/OneDrive/Dokumente/3) Universität Bern/Akademisches Jahr 2021-2022/Computational Social Science/Hate Speech Detection using BERT/recent_search_body_test1_df.csv',100)

RuntimeError: Found dtype Long but expected Float when fine-tuning using Trainer API

I'm trying to fine-tune BERT model for sentiment analysis (classifying text as positive/negative) with Huggingface Trainer API. My dataset has two columns, Text and Sentiment, it looks like this.
Text Sentiment
This was good place 1
This was bad place 0
Here is my code:
from datasets import load_dataset
from datasets import load_dataset_builder
from datasets import Dataset
import datasets
import transformers
from transformers import TrainingArguments
from transformers import Trainer
dataset = load_dataset('csv', data_files='./train/test.csv', sep=';')
tokenizer = transformers.BertTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")
model = transformers.BertForSequenceClassification.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1", num_labels=1)
def tokenize_function(examples):
return tokenizer(examples["Text"], truncation=True, padding='max_length')
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column('Sentiment', 'label')
tokenized_datasets = tokenized_datasets.remove_columns('Text')
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
model=model, args=training_args, train_dataset=tokenized_datasets['train']
)
trainer.train()
Running this throws error:
Variable._execution_engine.run_backward(
RuntimeError: Found dtype Long but expected Float
The error may come from dataset itself, but can I fix it with my code somehow? I searched the Internet and this error seems to have been previously solved by "converting tensors to float" but how would I do it with Trainer API? Any advise is very highly appreciated.
Some reference:
https://discuss.pytorch.org/t/run-backward-expected-dtype-float-but-got-dtype-long/61650/10
Most likely, the problem is with loss function. This can be fixed if you set up the model correctly, mainly by specifying the correct loss to use. Refer to this code to see the logic for deciding the proper loss.
Your problem has binary labels and thus should be framed as a single-label classification problem. As such, the code you have shared will be inferred as a regression problem, which explains the error that it expected float but found long type for target labels.
You need to pass the correct problem type.
model = transformers.BertForSequenceClassification.from_pretrained(
"TurkuNLP/bert-base-finnish-cased-v1",
num_labels=1,
problem_type = "single_label_classification"
)
This will make use of BCE loss. For BCE loss, you need the target to float, so you also have to cast the labels to float. I think you can do that with the dataset API. See this.
The other way would be to use a multi-class classifier or CE loss. For that, just fixing num_labels should be fine.
model = transformers.BertForSequenceClassification.from_pretrained(
"TurkuNLP/bert-base-finnish-cased-v1",
num_labels=2,
)
Here I am assuming that you are trying to do one label classification, that is, to predict a single result instead of predicting multiple results.
But the loss function (I don't know what you are using but it is probably BCE) you use, expects a vector from you as a label.
So either you need to convert your labels to vectors as people suggested in the comments, or you can replace the loss function with Cross-entropy loss and change your number of label parameters with 2(or whatever). Both solutions will work.
If you want to train your model as multi-label classifier you can convert your labels to vectors with using sklearn.preprocessing:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
dataset = pd.read_csv("filename.csv", encoding="utf-8")
enc_labels = preprocessing.LabelEncoder()
int_encoded = enc_labels.fit_transform(np.array(dataset["Sentiment"].to_list()))
onehot_encoder = OneHotEncoder(sparse = False)
int_encoded = int_encoded.reshape(len(int_encoded),1)
onehot_encoded = onehot_encoder.fit_transform(int_encoded)
for index, cat in dataset.iterrows():
dataset.at[index , 'Sentiment'] = onehot_encoded[index]
You could cast your data.
If you have it in Pandas format. You could do:
df['column_name'] = df['column_name'].astype(float)
If you have it in HuggingFace format. You should do something like that:
from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')
from datasets import Value, ClassLabel
new_features = dataset.features.copy()
new_features["idx"] = Value('int64')
new_features["label"] = ClassLabel(names=['negative', 'positive'])
new_features["idx"] = Value('int64')
dataset = dataset.cast(new_features)
Before:
dataset.features
{'idx': Value(dtype='int32', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None)}
After:
dataset.features
{'idx': Value(dtype='int64', id=None),
'label': ClassLabel(num_classes=2, names=['negative', 'positive'], id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None)}

Confusion Matrix on H2O

Final Edit: this problem ended up occurring because the target array were integers that were supposed to represent categories so it was doing a regression. Once I converted them into factors using .asfactor(), then the confusion matrix method detailed in the answer below worked
I am trying to run a confusion matrix on my Random Forest Model (my_model), but the documentation has been less than helpful. From here it says the command is h2o.confusionMatrix(my_model) but there is no such thing in 3.0.
Here are the steps to fit the model:
from h2o.estimators.random_forest import H2ORandomForestEstimator
data_h = h2o.H2OFrame(data)
train, valid = data_h.split_frame(ratios=[.7], seed = 1234)
my_model = H2ORandomForestEstimator(model_id = "rf_h", ntrees = 400,
max_depth = 30, nfolds = 8, seed = 25)
my_model.train(x = features, y = target, training_frame=train)
pred = rf_h.predict(valid)
I have tried the following:
my_model.confusion_matrix()
AttributeError: type object 'H2ORandomForestEstimator' has no attribute
'confusion_matrix'
Gotten from this example.
I have attempted to use tab completion to find out what it might be and have tried:
h2o.model.confusion_matrix(my_model)
TypeError: 'module' object is not callable
and
h2o.model.ConfusionMatrix(my_model)
which outputs simply all the model diagnostics and then the error:
H2OTypeError: Argument `cm` should be a list, got H2ORandomForestEstimator
Finally,
h2o.model.ConfusionMatrix(pred)
Which gives the same error as above.
Not sure what to do here, how can I view the results of the confusion matrix of the model?
Edit: Added more code to the beginning of the question for Context
please see the documentation for the full parameter list. For your convenience here is the list confusion_matrix(metrics=None, thresholds=None, train=False, valid=False, xval=False).
Here is a working example of how to use the method:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)
# try using the binomial_double_trees (boolean parameter):
# Initialize and train a DRF
cars_drf = H2ORandomForestEstimator(binomial_double_trees = False, seed = 1234)
cars_drf.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
cars_drf.confusion_matrix()
# or specify the validation frame
cars_drf.confusion_matrix(valid=True)

Could not find trained model in model_dir

I have been going through Google's Machine Learning Crash Course, and am at the "First steps with TensorFlow" section. I wanted to run the examples on my machine, and keep getting an error that says:
ValueError: Could not find trained model in model_dir: C:\Users\Username\AppData
\Local\Temp\tmpowu7j37s.
The folder at the end is different every time I run the script. So it's creating a directory for model_dir, but then puts nothing there, or puts my model there and it is deleted by the time the predict() method is called.
If I try to define model_dir in the estimator.LinearRegressor init method and set the checkpoint_path of the predict() method to the same directory, it tells me access is denied no matter where I point, in C or to C:\Users, etc.
I should also mention I am executing inside an Anaconda environment.
Any help is greatly appreciated!
import math
from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset
tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
#LOAD Dataset
california_housing_dataframe = pd.read_csv("california_housing_train.csv", sep=",")
#Randomize data (to avoid ordering bias) and div a clumn by 1000 to get to a learning rate we usually work with
california_housing_dataframe = california_housing_dataframe.reindex(
np.random.permutation(california_housing_dataframe.index))
california_housing_dataframe["median_house_value"] /= 1000.0
print(california_housing_dataframe) #print top and botton 5 rows (see max rows 10 above)
#examine data briefly
print(california_housing_dataframe.describe())
#________________________________________________________________________________________
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]
# Define the label.
targets = california_housing_dataframe["median_house_value"]
#__________________________________________________________________________________________
# Use gradient descent as the optimizer for training the model.
my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
# Configure the linear regression model with our feature columns and optimizer.
# Set a learning rate of 0.0000001 for Gradient Descent.
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer
)
#______________________________________________________________________________________________
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
"""Trains a linear regression model of one feature.
Args:
features: pandas DataFrame of features
targets: pandas DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""
# Convert pandas data into a dict of np arrays.
features = {key:np.array(value) for key,value in dict(features).items()}
# Construct a dataset, and configure batching/repeating
ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)
# Shuffle the data, if specified
if shuffle:
ds = ds.shuffle(buffer_size=10000)
# Return the next batch of data
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
#_______________________________________________________________________________________________
_ = linear_regressor.train(
input_fn = lambda:my_input_fn(my_feature, targets),
steps=100
)
#__________________________________________________________________________________________________
print(linear_regressor.model_dir)
# Create an input function for predictions.
# Note: Since we're making just one prediction for each example, we don't
# need to repeat or shuffle the data here.
prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)
# Call predict() on the linear_regressor to make predictions.
predictions = linear_regressor.predict(input_fn = prediction_input_fn
)
# Format predictions as a NumPy array, so we can calculate error metrics.
predictions = np.array([item['predictions'][0] for item in predictions])
Full traceback:
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\Username\
AppData\Local\Temp\tmpowu7j37s
C:\Users\Username\AppData\Local\Temp\tmpowu7j37s
Traceback (most recent call last):
File "fstf.py", line 104, in <module>
predictions = np.array([item['predictions'][0] for item in predictions])
File "fstf.py", line 104, in <listcomp>
predictions = np.array([item['predictions'][0] for item in predictions])
File "C:\Users\Username\AppData\Local\conda\conda\envs\tensorflow\lib\site-pac
kages\tensorflow\python\estimator\estimator.py", line 471, in predict
self._model_dir))
ValueError: Could not find trained model in model_dir: C:\Users\Username\AppData
\Local\Temp\tmpowu7j37s.
Because you didn't specify a parameter for the LinearRegressor,so your trained model is saved in the system temporary directory and deleted/cleaned by the system when your program was completed.
So you should specify a model_dir parameter for LinearRegressor.
The __init__ function of LinearRegressor is that:
__init__(
feature_columns,
model_dir=None,
weight_column_name=None,
optimizer=None,
gradient_clip_norm=None,
enable_centered_bias=False,
label_dimension=1,
_joint_weights=False,
config=None,
feature_engineering_fn=None
)
You can read the doc here
In terms of your code, you should change these code
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer
)
to
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer,
model_dir="./your_own_model_dir"
)
Your program will run successfully, Good Luck!!
i also meet the problem,and i solve that by add these codes:
linear_regressor.train(
input_fn = lambda:my_input_fn(my_feature, targets),
steps=100
)
because you miss the Step 5: Train the Model
you should set eval_steps is 1 or smaller,
and set eval_batch_size is all eval data or more bigger.
if it will evaluate much steps,
for cheackpoint life circle, it only keep 5 last .ckpt for default(you can customize).
and the next step's batch can't to evaluate.
and it will raise a error:
ValueError: Could not find trained model in model_dir: {your_model_dir}.
more detail:
- https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig
- https://github.com/colinwke/wide_deep_demo

Tensorflow error: "Tensor must be from the same graph as Tensor..."

I am trying to train a simple binary logistic regression classifier using Tensorflow (version 0.9.0) in a very similar way to the beginner's tutorial and am encountering the following error when fitting the model:
ValueError: Tensor("centered_bias_weight:0", shape=(1,), dtype=float32_ref) must be from the same graph as Tensor("linear_14/BiasAdd:0", shape=(?, 1), dtype=float32).
Here is my code:
import tempfile
import tensorflow as tf
import pandas as pd
# Customized training data parsing
train_data = read_train_data()
feature_names = get_feature_names(train_data)
labels = get_labels(train_data)
# Construct dataframe from training data features
x_train = pd.DataFrame(train_data , columns=feature_names)
x_train["label"] = labels
y_train = tf.constant(labels)
# Create SparseColumn for each feature (assume all feature values are integers and either 0 or 1)
feature_cols = [ tf.contrib.layers.sparse_column_with_integerized_feature(f,2) for f in feature_names ]
# Create SparseTensor for each feature based on data
categorical_cols = { f: tf.SparseTensor(indices=[[i,0] for i in range(x_train[f].size)],
values=x_train[f].values,
shape=[x_train[f].size,1]) for f in feature_names }
# Initialize logistic regression model
model_dir = tempfile.mkdtemp()
model = tf.contrib.learn.LinearClassifier(feature_columns=feature_cols, model_dir=model_dir)
def eval_input_fun():
return categorical_cols, y_train
# Fit the model - similarly to the tutorial
model.fit(input_fn=eval_input_fun, steps=200)
I feel like I'm missing something critical... maybe something that was assumed in the tutorial but wasn't explicitly mentioned?
Also, I get the following warning every time I call fit():
WARNING:tensorflow:create_partitioned_variables is deprecated. Use tf.get_variable with a partitioner set, or tf.get_partitioned_variable_list, instead.
When you execute model.fit, the LinearClassifier is creating a separate tf.Graph based on the Ops contained in your eval_input_fun function. But, during the creation of this Graph, LinearClassifier doesn't have access to the definitions of categorical_cols and y_train you saved globally.
Solution: move all the Ops definitions (and their dependencies) inside eval_input_fun

Categories

Resources