There are plenty good tutorials on building a tensorflow model and I achieved creating a model with good accuracy. However, there are 2 questions left.
In my dataset there are plenty of classes, I try to illustrate that like this:
label - text
--------------------
A - this is a A text
B - this is a B text
C - this is a C text
...
Z - this is a Z text
...
ZA - this is a ZA text
...
Now I want to build a network that leans to classify the texts. I understand, that I have to deliver a fixed set of labels, because the net needs to have a fixed count of "output neurons". So, for learning purposes, I started to build a network only for the 3 classes A, B and C. I fed the network only with corresponding rows (A, B, C) and I got a model, that can recognize A, B, C with good accuracy.
Now I want to predict new texts and would like to get an output like this:
input text -> predicted label
----------------------------
this is a B text -> B // successful prediction
this is a xyz text -> ? // cannot be predicted, because not learned
How do I achieve the "not predictable" for the not yet learned classes?
All in all my way to get a csv-file with added prediction column might be a little clumsy. Could you show me how to do it better?
import pandas as pd
df = pd.read_parquet(path)
#print(df)
#label = df['kategorie'].fillna("N/A")
text = df['text'].fillna("")
text_padded = tokenize_and_pad(text)
# Predictions
probability_model = tf.keras.Sequential([model,
tf.keras.layers.Softmax()])
predictions = probability_model.predict(text_padded)
# get the predicted labels
# I only achieved this with this loop - there must be a more elegant way???
predictedLabels = []
for prediction in predictions:
labelID = np.argmax(prediction)
predictedLabel = label_encoder.inverse_transform([labelID])
predictedLabels.append(predictedLabel)
# add the new column to the dataframe
# the prediction is accurate for the learned labels
# but totally wrong for the labels, that I excluded from the learning
df['predictedLabels'] = predictedLabels
# todo: write to file
From your question, I understand that you need help in two areas:
Answer to the question, How do I achieve the "not predictable" for the not yet learned classes?:
a. Since you want to consider only 3 Classes, instead of deleting the Rows corresponding to other Classes, you can replace the Names of those Columns with, "Not Predictable" i.e., replace 'D', 'E', 'F', etc.. with "Not Predictable".
b. In the Final Dense Layer, change the Number of Neurons from 3 to 4, 4th class representing "Not Predictable"
Answer to the question, How to write Predictions to a CSV File:
Now that Predictions are added as a Column in the DataFrame, df you can write it to a CSV File using the command,
df.to_csv('My_Predictions.csv')
For more information about this command, please refer this link.
The way you are accessing the Labels looks elegant.
Please let me know if you face any other error and I will be Happy to help you.
Hope this helps. Happy Learning!
Related
I'm using customized text with 'Prompt' and 'Completion' to train new model.
Here's the tutorial I used to create customized model from my data:
beta.openai.com/docs/guides/fine-tuning/advanced-usage
However even after training the model and sending prompt text to the model, I'm still getting generic results which are not always suitable for me.
How I can make sure completion results for my prompts will be only from the text I used for the model and not from the generic OpenAI models?
Can I use some flags to eliminate results from generic models?
Wrong goal: OpenAI API should answer from the fine-tuning dataset if the prompt is similar to the one from the fine-tuning dataset
It's the completely wrong logic. Forget about fine-tuning. As stated on the official OpenAI website:
Fine-tuning lets you get more out of the models available through the
API by providing:
Higher quality results than prompt design
Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests
Fine-tuning is not about answering with a specific answer from the fine-tuning dataset.
Fine-tuning helps the model gain more knowledge, but it has nothing to do with how the model answers. Why? The answer we get from the fine-tuned model is based on all knowledge (i.e., fine-tuned model knowledge = default knowledge + fine-tuning knowledge).
Although GPT-3 models have a lot of general knowledge, sometimes we want the model to answer with a specific answer (i.e., "fact").
Correct goal: Answer with a "fact" when asked about a "fact", otherwise answer with the OpenAI API
Note: For better (visual) understanding, the following code was ran and tested in Jupyter.
STEP 1: Create a .csv file with "facts"
To keep things simple, let's add two companies (i.e., ABC and XYZ) with a content. The content in our case will be a 1-sentence description of the company.
companies.csv
Run print_dataframe.ipynb to print the dataframe.
print_dataframe.ipynb
import pandas as pd
df = pd.read_csv('companies.csv')
df
We should get the following output:
STEP 2: Calculate an embedding vector for every "fact"
An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents (source).
Let's test the Embeddings endpoint first. Run get_embedding.ipynb with an input This is a test.
Note: In the case of Embeddings endpoint, the parameter prompt is called input.
get_embedding.ipynb
import openai
openai.api_key = '<OPENAI_API_KEY>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = model,
input = text
)
return result['data'][0]['embedding']
print(get_embedding('text-embedding-ada-002', 'This is a test'))
We should get the following output:
What we see in the screenshot above is This is a test as an embedding vector. More precisely, we get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with a 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space which is very hard to imagine.
There are two things we need to understand at this point:
Why do we need to transform text into an embedding vector (i.e., numbers)? Because later on, we can compare embedding vectors and figure out how similar the two texts are. We can't compare texts as such.
Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002 model has an output dimension of 1536. It's pre-defined.
Now we can create an embedding vector for each "fact". Run get_all_embeddings.ipynb.
get_all_embeddings.ipynb
import openai
from openai.embeddings_utils import get_embedding
import pandas as pd
openai.api_key = '<OPENAI_API_KEY>'
df = pd.read_csv('companies.csv')
df['embedding'] = df['content'].apply(lambda x: get_embedding(x, engine = 'text-embedding-ada-002'))
df.to_csv('companies_embeddings.csv')
The code above will take the first company (i.e., x), get its 'content' (i.e., "fact") and apply the function get_embedding using the text-embedding-ada-002 model. It will save the embedding vector of the first company in a new column named 'embedding'. Then it will take the second company, the third company, the fourth company, etc. At the end, the code will automatically generate a new .csv file named companies_embeddings.csv.
Saving embedding vectors locally (i.e., in a .csv file) means we don't have to call the OpenAI API every time we need them. We calculate an embedding vector for a given "fact" once and that's it.
Run print_dataframe_embeddings.ipynb to print the dataframe with the new column named 'embedding'.
print_dataframe_embeddings.ipynb
import pandas as pd
import numpy as np
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df
We should get the following output:
STEP 3: Calculate an embedding vector for the input and compare it with embedding vectors from the companies_embeddings.csv using cosine similarity
We need to calculate an embedding vector for the input so that we can compare the input with a given "fact" and see how similar these two texts are. Actually, we compare the embedding vector of the input with the embedding vector of the "fact". Then we compare the input with the second "fact", the third "fact", the fourth "fact", etc. Run get_cosine_similarity.ipynb.
get_cosine_similarity.ipynb
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
openai.api_key = '<OPENAI_API_KEY>'
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
input_embedding_vector = get_embedding(my_model, my_input)
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
df
The code above will take the input and compare it with the first fact. It will save the calculated similarity of the two in a new column named 'similarity'. Then it will take the second fact, the third fact, the fourth fact, etc.
If my_input = 'Tell me something about company ABC':
If my_input = 'Tell me something about company XYZ':
If my_input = 'Tell me something about company Apple':
We can see that when we give Tell me something about company ABC as an input, it's the most similar to the first "fact". When we give Tell me something about company XYZ as an input, it's the most similar to the second "fact". Whereas, if we give Tell me something about company Apple as an input, it's the least similar to any of these two "facts".
STEP 4: Answer with the most similar "fact" if similarity is above our threshold, otherwise answer with the OpenAI API
Let's set our similarity threshold to >= 0.9. The code below should answer with the most similar "fact" if similarity is >= 0.9, otherwise answer with the OpenAI API. Run get_answer.ipynb.
get_answer.ipynb
# Imports
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
import numpy as np
# Insert your API key
openai.api_key = '<OPENAI_API_KEY>'
# Insert OpenAI text embedding model and input
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT>'
# Calculate embedding vector for the input using OpenAI Embeddings endpoint
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
# Save embedding vector of the input
input_embedding_vector = get_embedding(my_model, my_input)
# Calculate similarity between the input and "facts" from companies_embeddings.csv file which we created before
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
# Find the highest similarity value in the dataframe column 'similarity'
highest_similarity = df['similarity'].max()
# If the highest similarity value is equal or higher than 0.9 then print the 'content' with the highest similarity
if highest_similarity >= 0.9:
fact_with_highest_similarity = df.loc[df['similarity'] == highest_similarity, 'content']
print(fact_with_highest_similarity)
# Else pass input to the OpenAI Completions endpoint
else:
response = openai.Completion.create(
model = 'text-davinci-003',
prompt = my_input,
max_tokens = 30,
temperature = 0
)
content = response['choices'][0]['text'].replace('\n', '')
print(content)
If my_input = 'Tell me something about company ABC' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv:
If my_input = 'Tell me something about company XYZ' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv:
If my_input = 'Tell me something about company Apple' and the threshold is >= 0.9 we should get the following answer from the OpenAI API:
How can I calculate Krippendorff Alpha for multi-label annotations?
In case of multi-class annotation (assuming that 3 coders have annotated 4 texts with 3 labels: a, b, c), I construct first the reliability data matrix and then coincidences and based on the coincidences I can calculate Alpha:
The question is how I can prepare the coincidences and calculate alpha in case of multi-label classification problem like the following case?
Python implementation or even excel would be appreciated.
Came across your question looking for similar information. We used the below code, with nltk.agreement for the metrics and pandas_ods_reader to read the data from a LibreOffice spreadsheet. Our data has two annotators, and for some of the items there can be two labels (for instance, one coder annotated one label only and the other coder annotated two labels instead).
The spreadsheet screencap below shows the structure of the input data. The column for annotation items is called annotItems, and annotation columns are called coder1 and coder2. The separator when there's more than one label is a pipe, unlike the comma in your example.
The code is inspired by this SO post: Low alpha for NLTK agreement using MASI distance
[Spreadsheet screencap]
from nltk import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
import pandas_ods_reader as pdreader
annotfile = "test-iaa-so.ods"
df = pdreader.read_ods(annotfile, "Sheet1")
annots = []
def create_annot(an):
"""
Create frozensets with the unique label
or with both labels splitting on pipe.
Unique label has to go in a list so that
frozenset does not split it into characters.
"""
if "|" in str(an):
an = frozenset(an.split("|"))
else:
# single label has to go in a list
# need to cast or not depends on your data
an = frozenset([str(int(an))])
return an
for idx, row in df.iterrows():
annot_id = row.annotItem + str.zfill(str(idx), 3)
annot_coder1 = ['coder1', annot_id, create_annot(row.coder1)]
annot_coder2 = ['coder2', annot_id, create_annot(row.coder2)]
annots.append(annot_coder1)
annots.append(annot_coder2)
# based on https://stackoverflow.com/questions/45741934/
jaccard_task = agreement.AnnotationTask(distance=jaccard_distance)
masi_task = agreement.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(annots)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
For the data in the screencap linked from this answer, this would print:
Statistics for dataset using <function jaccard_distance at 0x7fa1464b6050>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.1818181818181818
Kappa: 0.35714285714285715
Multi-Kappa: 0.35714285714285715
Alpha: 0.02941176470588236
Statistics for dataset using <function masi_distance at 0x7fa1464b60e0>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.09181818181818181
Kappa: 0.2864285714285714
Multi-Kappa: 0.2864285714285714
Alpha: 0.017962466487935425
this is my first question here.
I've been wanting to create a dataset with the popular IMDb dataset for learning purpose. The directories are as follows: .../train/pos/ and .../train/neg/ . I created a function which will merge text files with its labels and getting a error. I need your help to debug!
def datasetcreate(filepath, label):
filepaths = tf.data.Dataset.list_files(filepath)
return tf.stack([tf.data.Dataset.from_tensor_slices((_, tf.constant(label, dtype='int32'))) for _ in tf.data.TextLineDataset(filepaths)])
datasetcreate(['aclImdb/train/pos/*.txt'],1)
And this is the error I'm getting:
ValueError: Value tf.Tensor(b'An American in Paris was, in many ways, the ultimate.....dancers of all time.', shape=(), dtype=string) has insufficient rank for batching.
Why does this happen and what can I do to get rid of this? Thanks.
Your code has two problems:
First, the way you load your TextLineDatasets, your loaded tensors contain string objects, which have an empty shape associated, i.e. a rank of zero. The rank of a tensor is the length of the shape property.
Secondly, you are trying to stack two tensors with different rank, which is would throw another error because, a sentence (a sequence of tokens) has a rank of 1 and the label as scalar has a rank of 0.
If you just need the dataset, I recommend to use the Tensorflow Dataset package, which has many ready-to-use datasets available.
If want to solve your particular problem, one way to fix your data pipeline is by using Datasest.interleave and the Dataset.zip functions.
# load positive sentences
filepaths = list(tf.data.Dataset.list_files('aclImdb/train/pos/*.txt'))
sentences_ds = tf.data.Dataset.from_tensor_slices(filepaths)
sentences_ds = sentences_ds.interleave(lambda text_file: tf.data.TextLineDataset(text_file) )
sentences_ds = sentences_ds.map( lambda text: tf.strings.split(text) )
# dataset for labels, create 1 label per file
labels = tf.constant(1, dtype="int32", shape=(len(filepaths)))
label_ds = tf.data.Dataset.from_tensor_slices(labels)
# combine text with label datasets
dataset = tf.data.Dataset.zip( (sentences_ds, label_ds) )
print( list(dataset.as_numpy_iterator() ))
First, you use the interleave function to combine multiple text datasets to one dataset. Next, you use tf.strings.split to split each text to its tokens. Then, you create a dataset for your positive labels. Finally, you combine the two datasets using zip.
IMPORTANT: To train/run any DL models on your dataset, you will likely need further pre-processing for your sentences, e.g. build a vocabulary and train word-embeddings.
I am just a beginner in this subject, I have tested some NN for image recognition as well as using NLP for sequence classification.
This second topic is interesting for me.
using
sentences = [
'some test sentence',
'and the second sentence'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sentences = tokenizer.texts_to_sequences(sentences)
will result with an array of size [n,1] where n is word size in sentence. And assuming I have implemented padding correctly each Training example in set will be size of [n,1] where n is the max sentence length.
that prepared training set I can pass into keras model.fit
what when I have multiple features in my data set?
let's say I would like to build an event prioritization algorithm and my data structure would look like:
[event_description, event_category, event_location, label]
trying to tokenize such array would result in [n,m] matrix where n is maximum word length and m is the feature number
how to prepare such a dataset so a model could be trained on such data?
would this approach be ok:
# Going through training set to get all features into specific ararys
for data in dataset:
training_sentence.append(data['event_description'])
training_category.append(data['event_category'])
training_location.append(data['event_location'])
training_labels.append(data['label'])
# Tokenize each array which contains tokenized value
tokenizer.fit_on_texts(training_sentence)
tokenizer.fit_on_texts(training_category)
tokenizer.fit_on_texts(training_location)
sequences = tokenizer.texts_to_sequences(training_sentence)
categories = tokenizer.texts_to_sequences(training_category)
locations = tokenizer.texts_to_sequences(training_location)
# Concatenating arrays with features into one
training_example = numpy.concatenate([sequences,categories, locations])
#ommiting model definition, training the model
model.fit(training_example, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
I haven't been testing it yet. I just want to make sure if I understand everything correctly and if my assumptions are correct.
Is this a correct approach to create NPL using NN?
I know of two common ways to manage multiple input sequences, and your approach lands somewhere between them.
One approach is to design a multi-input model with each of your text columns as a different input. They can share the vocabulary and/or embedding layer later, but for now you still need a distinct input sub-model for each of description, category, etc.
Each of these becomes an input to the network, using the Model(inputs=[...], outputs=rest_of_nn) syntax. You will need to design rest_of_nn so it can take multiple inputs. This can be as simple as your current concatenation, or you could use additional layers to do the synthesis.
It could look something like this:
# Build separate vocabularies. This could be shared.
desc_tokenizer = Tokenizer()
desc_tokenizer.fit_on_texts(training_sentence)
desc_vocab_size = len(desc_tokenizer.word_index)
categ_tokenizer = Tokenizer()
categ_tokenizer.fit_on_texts(training_category)
categ_vocab_size = len(categ_tokenizer.word_index)
# Inputs.
desc = Input(shape=(desc_maxlen,))
categ = Input(shape=(categ_maxlen,))
# Input encodings, opting for different embeddings.
# Descriptions go through an LSTM as a demo of extra processing.
embedded_desc = Embedding(desc_vocab_size, desc_embed_size, input_length=desc_maxlen)(desc)
encoded_desc = LSTM(categ_embed_size, return_sequences=True)(embedded_desc)
encoded_categ = Embedding(categ_vocab_size, categ_embed_size, input_length=categ_maxlen)(categ)
# Rest of the NN, which knows how to put everything together to get an output.
merged = concatenate([encoded_desc, encoded_categ], axis=1)
rest_of_nn = Dense(hidden_size, activation='relu')(merged)
rest_of_nn = Flatten()(rest_of_nn)
rest_of_nn = Dense(output_size, activation='softmax')(rest_of_nn)
# Create the model, assuming some sort of classification problem.
model = Model(inputs=[desc, categ], outputs=rest_of_nn)
model.compile(optimizer='adam', loss=K.categorical_crossentropy)
The second approach is to concatenate all of your data before encoding it, and then treat everything as a more standard single-sequence problem after that. It is common to use a unique token to separate or define the different fields, similar to BOS and EOS for the beginning and end of the sequence.
It would look something like this:
XXBOS XXDESC This event will be fun. XXCATEG leisure XXLOC Seattle, WA XXEOS
You can also do end tags for the fields like DESCXX, omit the BOS and EOS tokens, and generally mix and match however you want. You can even use this to combine some of your input sequences, but then use a multi-input model as above to merge the rest.
Speaking of mixing and matching, you also have the option to treat some of your inputs directly as an embedding. Low-cardinality fields like category and location do not need to be tokenized, and can be embedded directly without any need to split into tokens. That is, they don't need to be a sequence.
If you are looking for a reference, I enjoyed this paper on Large Scale Product Categorization using Structured and Unstructured Attributes. It tests all or most of the ideas I have just outlined, on real data at scale.
I am trying to understand why the output from sklearn prediction is different when put inside a function.
I have a naive bayes classifier trained for text and when I make my predictions like this
examples = ['my favorite sport is probably baseball']
predictions = vec_clf.predict(examples)[0]
probs = vec_clf.predict_proba(examples)
m = np.max(probs)
print predictions,m
I get the right prediction result. However if I write a function to do this
def classify(input):
predictions = vec_clf.predict(input)[0]
probs = vec_clf.predict_proba(input)
m = np.max(probs)
return predictions,m
classify('my favorite sport is probably baseball')
It returns a completely different and very wrong result with different confidence and class label. Why would it do this?
In the first attempt, you are passing a list of strings to model.predict_proba and model.predict (which is what is expected), in the latter attempt, you are passing a single string. Instead, pass a list of strings:
classify(['my favorite sport is probably baseball'])
Or wrap input in a list inside your function:
def classify(input):
input = [input]
predictions = vec_clf.predict(input)[0]
probs = vec_clf.predict_proba(input)
m = np.max(probs)
return predictions,m
What is going on when you only pass a string is that each individual character is being interpreted as a document. So, try just doing:
vec_clf.predict('my favorite sport is probably baseball')
To better see what is going on.