Training New AutoTokenizer Hugging Face - python

Getting this error: AttributeError: 'GPT2Tokenizer' object has no
attribute 'train_new_from_iterator'
Very similar to hugging face documentation. I changed the input and that's it (shouldn't affect it). It worked once. Came back to it 2 hrs later and it doesn't... nothing was changed NOTHING. Documentation states train_new_from_iterator only works with 'fast' tokenizers and that AutoTokenizer is supposed to pick a 'fast' tokenizer by default. My best guess is, it is having some trouble with this. I also tried downgrading transformers and reinstalling to no success. df is just one column of text.
from transformers import AutoTokenizer
import tokenizers
def batch_iterator(batch_size=10, size=5000):
for i in range(100): #2264
query = f"select note_text from cmx_uat.note where id > {i * size} limit 50;"
df = pd.read_sql(sql=query, con=cmx_uat)
for x in range(0, size, batch_size):
yield list(df['note_text'].loc[0:5000])[x:x + batch_size]
old_tokenizer = AutoTokenizer.from_pretrained('roberta')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)

There are two things for keeping in mind:
First: The train_new_from_iterator works with fast tokenizers only.
(here you can read more)
Second: The training corpus. Should be
a generator of batches of texts, for instance, a list of lists of
texts if you have everything in memory. (official documents)
def batch_iterator(batch_size=3, size=8):
df = pd.DataFrame({"note_text": ['fghijk', 'wxyz']})
for x in range(0, size, batch_size):
yield df['note_text'].to_list()
old_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
print(old_tokenizer( ['fghijk', 'wxyz']))
print(new_tokenizer( ['fghijk', 'wxyz']))
output:
{'input_ids': [[0, 506, 4147, 18474, 2], [0, 605, 32027, 329, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
{'input_ids': [[0, 22, 2], [0, 21, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}

Related

Interpreting results from linearmodels PanelOLS .predict() method

Suppose I have the following toy data:
import pandas as pd
from linearmodels.panel import PanelOLS
y = pd.DataFrame(
index=[[1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3]],
data=[70, 60, 50, 30, 33, 27],
columns=["y"],
)
y.index.set_names(["Entity", "Time"], inplace=True)
x = pd.DataFrame(
index=[[1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3]],
data=[[100], [89], [62], [29], [49], [23]],
columns=["X"],
)
x.index.set_names(["Entity", "Time"], inplace=True)
And build a model using PanelOLS with entity_effects=True:
model_within = PanelOLS(dependent=y, exog=x, entity_effects=True).fit()
And then wanted to use the predict() method to see how a new "entity" would be modelled. First creating a new entity with:
new_x = pd.DataFrame(
index=[[3, 3, 3], [1, 2, 3]],
data=[[40], [70], [33]],
columns=["X"],
)
new_x.index.set_names(["Entity", "Time"], inplace=True)
Then predicting with:
model_within.predict(new_x)
To get the following output:
predictions
Entity
Time
3
1
16.136230
2
28.238403
3
13.312390
According to Wooldridge, 2012, pg 485, the within estimator is estimating:
Since this is modelling a change in expected y from the average of past y's for this entity, how should the predictions be interpreted? My intuition is that the prediction is saying:
For this new entity, 3, in time period 1, given these X inputs, y at time 1 should be 16 units higher than it's average y across all time, for this entity. Is this interpretation correct? How might it be improved?
linearmodels .predict() documentation
Posting results from seeking clarification through the repo:
https://github.com/bashtage/linearmodels/issues/465
"The model is always Y=XB + epsilon + (eta_t ) + (nu_i ). The effects are treated as errors, and so when you predict you get new_x # params and so the entity effects are not used."
So the predictions are for actual values of y, not time-demeaned predictions. However, to achieve time-demeaned predictions, one can create the same model using data that has first been time-demeaned, and pass in new time-demeaned data to predict on.

Why does this TensorFlow code behave differently when inside a test case?

I have a function (foo below) which is behaving differently when it's run directly vs when it is run inside a tf.test.TestCase.
The code is supposed to create a dataset with elems [1..5] and shuffle it. Then it repeats 3 times: create an iterator from the data and use that to print the 5 elements.
When run on its own it gives output where all the lists are shuffled e.g.:
[4, 0, 3, 2, 1]
[0, 2, 1, 3, 4]
[2, 3, 4, 0, 1]
but when run inside a test case they are always the same, even between runs:
[0, 4, 2, 3, 1]
[0, 4, 2, 3, 1]
[0, 4, 2, 3, 1]
I imagine it's something to do with how test cases handle random seeds but I can't see anything about that in the TensorFlow docs. Thanks for any help!
Code:
import tensorflow as tf
def foo():
sess = tf.Session()
dataset = tf.data.Dataset.range(5)
dataset = dataset.shuffle(5, reshuffle_each_iteration=False)
for _ in range(3):
data_iter = dataset.make_one_shot_iterator()
next_item = data_iter.get_next()
with sess.as_default():
data_new = [next_item.eval() for _ in range(5)]
print(data_new)
class DatasetTest(tf.test.TestCase):
def testDataset(self):
foo()
if __name__ == '__main__':
foo()
tf.test.main()
I am running it with Python 3.6 and TensorFlow 1.4. No other modules should be needed.
I think you are right; tf.test.TestCase is being setup to use fixed seed.
class TensorFlowTestCase(googletest.TestCase):
# ...
def setUp(self):
self._ClearCachedSession()
random.seed(random_seed.DEFAULT_GRAPH_SEED)
np.random.seed(random_seed.DEFAULT_GRAPH_SEED)
ops.reset_default_graph()
ops.get_default_graph().seed = random_seed.DEFAULT_GRAPH_SEED
and
DEFAULT_GRAPH_SEED = 87654321 see this line in tensorflow/tensorflow/python/framework/random_seed.py.

InvalidArgumentError TensorFlow sparse_to_dense

I have a sparse tensor that I'm building up from a collection of indices and values. I'm trying to implement some code to take a full row slice. Although this functionality doesn't appear to be directly supported in TensorFlow, it seems that there are work arounds that can return the indices and values for a specified row as follows:
def sparse_slice(self, indices, values, needed_row_ids):
needed_row_ids = tf.reshape(needed_row_ids, [1, -1])
num_rows = tf.shape(indices)[0]
partitions = tf.cast(tf.reduce_any(tf.equal(tf.reshape(indices[:, 0], [-1, 1]), needed_row_ids), 1),
tf.int32)
rows_to_gather = tf.dynamic_partition(tf.range(num_rows), partitions, 2)[1]
slice_indices = tf.gather(indices, rows_to_gather)
slice_values = tf.gather(values, rows_to_gather)
return slice_indices, slice_values
Then called directly like so on a sparse 4x4 matrix where I am interested in accessing all of the elements in row 3:
with tf.Session().as_default():
indices = tf.constant([[0, 0], [1, 0], [2, 0], [2, 1], [3, 0], [3, 3]])
values = tf.constant([10, 19, 1, 1, 6, 5], dtype=tf.int64)
needed_row_ids = tf.constant([3])
slice_indices, slice_values = self.sparse_slice(indices, values, needed_row_ids)
print('indicies: {} and rows: {}'.format(slice_indices.eval(), slice_values.eval()))
Which outputs the following:
indicies: [[3 0]
[3 3]] and rows: [6 5]
So far so good, I then figure I can use this information to construct a 1x4 dense tensor with the values at the indexes and 0s for the missing columns.
dense_representation = tf.sparse_to_dense(sparse_values=slice_values, sparse_indices=slice_indices,
output_shape=(1,4))
However the moment I run the tensor in a session.
sess = tf.Session()
sess.run(dense_representation)
I receive the following exception:
InvalidArgumentError (see above for traceback): indices[0] = [3,0] is out of bounds: need 0 <= index < [1,4]
[[Node: SparseToDense = SparseToDense[T=DT_INT64, Tindices=DT_INT32, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Gather_2, SparseToDense/output_shape, Gather_3, SparseToDense/default_value)]]
I'm not quite sure what I'm doing wrong or if this has something to do with the output_shape not being properly formed. Essentially I'd like to stuff this all back into a 1 x 4 vector. I haven't been able to find any good examples online for how to do this. Any help would be appreciated.

How to get multi-label output after using tf.nn.sigmoid()

Purpose:
I want to get label name directly in tensorflow-serving when prediction, my question is how to transfer pred = tf.nn.sigmoid(last_layer_output) to real label name?
Question Description:
I know how to do it with multi-classes issue:
CLASSES = tf.constant(['a', 'b', 'c', 'd', 'e'])
pred = tf.nn.softmax(last_layer_output)
# pred pretty similar to:
pred = [[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]]
classes_value = tf.argmax(pred, 1)
classes_name = tf.gather(CLASSES, classes_value)
# classes_name: [b'b' b'd' b'e']
# batch_size = 3
So classes_name is what I want, I could use it design signature in tensorflow-serving, when I prediction, I could get final labels.
But how to do like this in multi-label issue?
e.g.
CLASSES = tf.constant(['a', 'b', 'c', 'd', 'e'])
pred = tf.nn.sigmoid(last_layer_output)
pred = tf.round(pred)
# pred pretty similar to:
pred = [[0, 1, 1, 0, 1], # 'b','c','e'
[1, 0, 0, 1, 0], # 'a','d'
[1, 1, 1, 0, 1]] # 'a','b','c','e'
How could I transfer pred to labels name? I can't do it after sees.run() or using other api like numpy because this is for tensorflow-serving, I think I need to do it using tf method.
Any suggestions are appreciated !
You should first define given probabilities of many classes, which classes you want to return. For example if the probability of this class is above 0.5.
Then you can use tf.where with proper condition to get indices and then same tf.gather to get classes.
Like this:
indicies = tf.where(tf.greater(pred, 0.5))
classes = tf.gather(CLASSES, indicies[:, 1])
Then you need to re-organize it using indicies[:, 0] that tells you which example the class from.
Also, you must understand that correct form of the answer is SparseTensor, which are not very supported by serving and etc. So you may want to return two tensors [strings and indicators which strings are for which example] and deal with on your client side.

How can I save a LibSVM python object instance?

I wanted to use this classifier in other computer without had to train it again.
I used to save some classifiers from scikit with cPickle.
Doing the same with LIBSVM it gives me a " ValueError: ctypes objects containing pointers cannot be pickled ".
I'm using LibSVM 3.1 and Python 2.7.3.
Thanks
from libsvm.svm import *
from libsvm.svmutil import *
import cPickle
x = [[1, 0, 1], [-1, 0, -1]]
y = [1, -1]
prob = svm_problem(y, x)
param = svm_parameter()
param.kernel_type = LINEAR
param.C = 10
m = svm_train(prob, param)
labels_pred, acc, probs = svm_predict([-1, 1], [[1, 1, 1], [0, 0, 1]], m)
print labels_pred, acc, probs
import ipdb; ipdb.set_trace()
filename='libsvm-classif.pkl'
fid = open(filename, 'wb')
cPickle.dump(m, fid)
fid.close()
fid = open(filename, 'rb')
m = cPickle.load(fid)
labels_pred, acc, probs = svm_predict([-1, 1], [[1, 1, 1], [0, 0, 1]], m)
print labels_pred, acc, probs
Just use libsvm's load and save functions
svm_save_model('libsvm.model', m)
m = svm_load_model('libsvm.model')
This is from the README file included in the python directory of the libsvm package. It seems to have a much better description of features than the website.

Categories

Resources