NLP for multi feature data set using TensorFlow - python
I am just a beginner in this subject, I have tested some NN for image recognition as well as using NLP for sequence classification.
This second topic is interesting for me.
using
sentences = [
'some test sentence',
'and the second sentence'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sentences = tokenizer.texts_to_sequences(sentences)
will result with an array of size [n,1] where n is word size in sentence. And assuming I have implemented padding correctly each Training example in set will be size of [n,1] where n is the max sentence length.
that prepared training set I can pass into keras model.fit
what when I have multiple features in my data set?
let's say I would like to build an event prioritization algorithm and my data structure would look like:
[event_description, event_category, event_location, label]
trying to tokenize such array would result in [n,m] matrix where n is maximum word length and m is the feature number
how to prepare such a dataset so a model could be trained on such data?
would this approach be ok:
# Going through training set to get all features into specific ararys
for data in dataset:
training_sentence.append(data['event_description'])
training_category.append(data['event_category'])
training_location.append(data['event_location'])
training_labels.append(data['label'])
# Tokenize each array which contains tokenized value
tokenizer.fit_on_texts(training_sentence)
tokenizer.fit_on_texts(training_category)
tokenizer.fit_on_texts(training_location)
sequences = tokenizer.texts_to_sequences(training_sentence)
categories = tokenizer.texts_to_sequences(training_category)
locations = tokenizer.texts_to_sequences(training_location)
# Concatenating arrays with features into one
training_example = numpy.concatenate([sequences,categories, locations])
#ommiting model definition, training the model
model.fit(training_example, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
I haven't been testing it yet. I just want to make sure if I understand everything correctly and if my assumptions are correct.
Is this a correct approach to create NPL using NN?
I know of two common ways to manage multiple input sequences, and your approach lands somewhere between them.
One approach is to design a multi-input model with each of your text columns as a different input. They can share the vocabulary and/or embedding layer later, but for now you still need a distinct input sub-model for each of description, category, etc.
Each of these becomes an input to the network, using the Model(inputs=[...], outputs=rest_of_nn) syntax. You will need to design rest_of_nn so it can take multiple inputs. This can be as simple as your current concatenation, or you could use additional layers to do the synthesis.
It could look something like this:
# Build separate vocabularies. This could be shared.
desc_tokenizer = Tokenizer()
desc_tokenizer.fit_on_texts(training_sentence)
desc_vocab_size = len(desc_tokenizer.word_index)
categ_tokenizer = Tokenizer()
categ_tokenizer.fit_on_texts(training_category)
categ_vocab_size = len(categ_tokenizer.word_index)
# Inputs.
desc = Input(shape=(desc_maxlen,))
categ = Input(shape=(categ_maxlen,))
# Input encodings, opting for different embeddings.
# Descriptions go through an LSTM as a demo of extra processing.
embedded_desc = Embedding(desc_vocab_size, desc_embed_size, input_length=desc_maxlen)(desc)
encoded_desc = LSTM(categ_embed_size, return_sequences=True)(embedded_desc)
encoded_categ = Embedding(categ_vocab_size, categ_embed_size, input_length=categ_maxlen)(categ)
# Rest of the NN, which knows how to put everything together to get an output.
merged = concatenate([encoded_desc, encoded_categ], axis=1)
rest_of_nn = Dense(hidden_size, activation='relu')(merged)
rest_of_nn = Flatten()(rest_of_nn)
rest_of_nn = Dense(output_size, activation='softmax')(rest_of_nn)
# Create the model, assuming some sort of classification problem.
model = Model(inputs=[desc, categ], outputs=rest_of_nn)
model.compile(optimizer='adam', loss=K.categorical_crossentropy)
The second approach is to concatenate all of your data before encoding it, and then treat everything as a more standard single-sequence problem after that. It is common to use a unique token to separate or define the different fields, similar to BOS and EOS for the beginning and end of the sequence.
It would look something like this:
XXBOS XXDESC This event will be fun. XXCATEG leisure XXLOC Seattle, WA XXEOS
You can also do end tags for the fields like DESCXX, omit the BOS and EOS tokens, and generally mix and match however you want. You can even use this to combine some of your input sequences, but then use a multi-input model as above to merge the rest.
Speaking of mixing and matching, you also have the option to treat some of your inputs directly as an embedding. Low-cardinality fields like category and location do not need to be tokenized, and can be embedded directly without any need to split into tokens. That is, they don't need to be a sequence.
If you are looking for a reference, I enjoyed this paper on Large Scale Product Categorization using Structured and Unstructured Attributes. It tests all or most of the ideas I have just outlined, on real data at scale.
Related
Azure OpenAIModel- How to create our own knowledge base and train them to a bot [duplicate]
I'm using customized text with 'Prompt' and 'Completion' to train new model. Here's the tutorial I used to create customized model from my data: beta.openai.com/docs/guides/fine-tuning/advanced-usage However even after training the model and sending prompt text to the model, I'm still getting generic results which are not always suitable for me. How I can make sure completion results for my prompts will be only from the text I used for the model and not from the generic OpenAI models? Can I use some flags to eliminate results from generic models?
Wrong goal: OpenAI API should answer from the fine-tuning dataset if the prompt is similar to the one from the fine-tuning dataset It's the completely wrong logic. Forget about fine-tuning. As stated on the official OpenAI website: Fine-tuning lets you get more out of the models available through the API by providing: Higher quality results than prompt design Ability to train on more examples than can fit in a prompt Token savings due to shorter prompts Lower latency requests Fine-tuning is not about answering with a specific answer from the fine-tuning dataset. Fine-tuning helps the model gain more knowledge, but it has nothing to do with how the model answers. Why? The answer we get from the fine-tuned model is based on all knowledge (i.e., fine-tuned model knowledge = default knowledge + fine-tuning knowledge). Although GPT-3 models have a lot of general knowledge, sometimes we want the model to answer with a specific answer (i.e., "fact"). Correct goal: Answer with a "fact" when asked about a "fact", otherwise answer with the OpenAI API Note: For better (visual) understanding, the following code was ran and tested in Jupyter. STEP 1: Create a .csv file with "facts" To keep things simple, let's add two companies (i.e., ABC and XYZ) with a content. The content in our case will be a 1-sentence description of the company. companies.csv Run print_dataframe.ipynb to print the dataframe. print_dataframe.ipynb import pandas as pd df = pd.read_csv('companies.csv') df We should get the following output: STEP 2: Calculate an embedding vector for every "fact" An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents (source). Let's test the Embeddings endpoint first. Run get_embedding.ipynb with an input This is a test. Note: In the case of Embeddings endpoint, the parameter prompt is called input. get_embedding.ipynb import openai openai.api_key = '<OPENAI_API_KEY>' def get_embedding(model: str, text: str) -> list[float]: result = openai.Embedding.create( model = model, input = text ) return result['data'][0]['embedding'] print(get_embedding('text-embedding-ada-002', 'This is a test')) We should get the following output: What we see in the screenshot above is This is a test as an embedding vector. More precisely, we get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with a 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space which is very hard to imagine. There are two things we need to understand at this point: Why do we need to transform text into an embedding vector (i.e., numbers)? Because later on, we can compare embedding vectors and figure out how similar the two texts are. We can't compare texts as such. Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002 model has an output dimension of 1536. It's pre-defined. Now we can create an embedding vector for each "fact". Run get_all_embeddings.ipynb. get_all_embeddings.ipynb import openai from openai.embeddings_utils import get_embedding import pandas as pd openai.api_key = '<OPENAI_API_KEY>' df = pd.read_csv('companies.csv') df['embedding'] = df['content'].apply(lambda x: get_embedding(x, engine = 'text-embedding-ada-002')) df.to_csv('companies_embeddings.csv') The code above will take the first company (i.e., x), get its 'content' (i.e., "fact") and apply the function get_embedding using the text-embedding-ada-002 model. It will save the embedding vector of the first company in a new column named 'embedding'. Then it will take the second company, the third company, the fourth company, etc. At the end, the code will automatically generate a new .csv file named companies_embeddings.csv. Saving embedding vectors locally (i.e., in a .csv file) means we don't have to call the OpenAI API every time we need them. We calculate an embedding vector for a given "fact" once and that's it. Run print_dataframe_embeddings.ipynb to print the dataframe with the new column named 'embedding'. print_dataframe_embeddings.ipynb import pandas as pd import numpy as np df = pd.read_csv('companies_embeddings.csv') df['embedding'] = df['embedding'].apply(eval).apply(np.array) df We should get the following output: STEP 3: Calculate an embedding vector for the input and compare it with embedding vectors from the companies_embeddings.csv using cosine similarity We need to calculate an embedding vector for the input so that we can compare the input with a given "fact" and see how similar these two texts are. Actually, we compare the embedding vector of the input with the embedding vector of the "fact". Then we compare the input with the second "fact", the third "fact", the fourth "fact", etc. Run get_cosine_similarity.ipynb. get_cosine_similarity.ipynb import openai from openai.embeddings_utils import cosine_similarity import pandas as pd openai.api_key = '<OPENAI_API_KEY>' my_model = 'text-embedding-ada-002' my_input = '<INSERT_INPUT>' def get_embedding(model: str, text: str) -> list[float]: result = openai.Embedding.create( model = my_model, input = my_input ) return result['data'][0]['embedding'] input_embedding_vector = get_embedding(my_model, my_input) df = pd.read_csv('companies_embeddings.csv') df['embedding'] = df['embedding'].apply(eval).apply(np.array) df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector)) df The code above will take the input and compare it with the first fact. It will save the calculated similarity of the two in a new column named 'similarity'. Then it will take the second fact, the third fact, the fourth fact, etc. If my_input = 'Tell me something about company ABC': If my_input = 'Tell me something about company XYZ': If my_input = 'Tell me something about company Apple': We can see that when we give Tell me something about company ABC as an input, it's the most similar to the first "fact". When we give Tell me something about company XYZ as an input, it's the most similar to the second "fact". Whereas, if we give Tell me something about company Apple as an input, it's the least similar to any of these two "facts". STEP 4: Answer with the most similar "fact" if similarity is above our threshold, otherwise answer with the OpenAI API Let's set our similarity threshold to >= 0.9. The code below should answer with the most similar "fact" if similarity is >= 0.9, otherwise answer with the OpenAI API. Run get_answer.ipynb. get_answer.ipynb # Imports import openai from openai.embeddings_utils import cosine_similarity import pandas as pd import numpy as np # Insert your API key openai.api_key = '<OPENAI_API_KEY>' # Insert OpenAI text embedding model and input my_model = 'text-embedding-ada-002' my_input = '<INSERT_INPUT>' # Calculate embedding vector for the input using OpenAI Embeddings endpoint def get_embedding(model: str, text: str) -> list[float]: result = openai.Embedding.create( model = my_model, input = my_input ) return result['data'][0]['embedding'] # Save embedding vector of the input input_embedding_vector = get_embedding(my_model, my_input) # Calculate similarity between the input and "facts" from companies_embeddings.csv file which we created before df = pd.read_csv('companies_embeddings.csv') df['embedding'] = df['embedding'].apply(eval).apply(np.array) df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector)) # Find the highest similarity value in the dataframe column 'similarity' highest_similarity = df['similarity'].max() # If the highest similarity value is equal or higher than 0.9 then print the 'content' with the highest similarity if highest_similarity >= 0.9: fact_with_highest_similarity = df.loc[df['similarity'] == highest_similarity, 'content'] print(fact_with_highest_similarity) # Else pass input to the OpenAI Completions endpoint else: response = openai.Completion.create( model = 'text-davinci-003', prompt = my_input, max_tokens = 30, temperature = 0 ) content = response['choices'][0]['text'].replace('\n', '') print(content) If my_input = 'Tell me something about company ABC' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv: If my_input = 'Tell me something about company XYZ' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv: If my_input = 'Tell me something about company Apple' and the threshold is >= 0.9 we should get the following answer from the OpenAI API:
Predict over a whole dataset using Transformers
I'm trying to zo zero-shot classification over a dataset with 5000 records. Right now I'm using a normal Python loop, but it is going painfully slow. Is there to speed up the process using Transformers or Datasets structures? This is how my code looks right now: classifier = pipeline("zero-shot-classification", model='cross-encoder/nli-roberta-base') # Create prediction list candidate_labels = ["Self-direction: action", "Achievement", "Security: personal", "Security: societal", "Benevolence: caring", "Universalism: concern"] predictions = [] for index, row in reduced_dataset.iterrows(): res = classifier(row["text"], candidate_labels) partial_prediction = [] for score in res["scores"]: if score >= 0.5: partial_prediction.append(1) else: partial_prediction.append(0) if index % 100 == 0: print(index) predictions.append(partial_prediction) partial_prediction
It is always more efficient to process sentences in batches that can be parallelized. According to the documentation, you can provide a list (or more precisely an Iterable) of sentences Instead of a single input sentence, and it will take automatically take care about all the hassles connected with batching (padding sentences to the same length, estimating batch size to fit memory, etc.) and the pipeline will return an Iterable of predictions. The documentation even recommends using the dataset objects as inputs to the pipelines.
"... has insufficient rank for batching." What is the problem with this 3 line code?
this is my first question here. I've been wanting to create a dataset with the popular IMDb dataset for learning purpose. The directories are as follows: .../train/pos/ and .../train/neg/ . I created a function which will merge text files with its labels and getting a error. I need your help to debug! def datasetcreate(filepath, label): filepaths = tf.data.Dataset.list_files(filepath) return tf.stack([tf.data.Dataset.from_tensor_slices((_, tf.constant(label, dtype='int32'))) for _ in tf.data.TextLineDataset(filepaths)]) datasetcreate(['aclImdb/train/pos/*.txt'],1) And this is the error I'm getting: ValueError: Value tf.Tensor(b'An American in Paris was, in many ways, the ultimate.....dancers of all time.', shape=(), dtype=string) has insufficient rank for batching. Why does this happen and what can I do to get rid of this? Thanks.
Your code has two problems: First, the way you load your TextLineDatasets, your loaded tensors contain string objects, which have an empty shape associated, i.e. a rank of zero. The rank of a tensor is the length of the shape property. Secondly, you are trying to stack two tensors with different rank, which is would throw another error because, a sentence (a sequence of tokens) has a rank of 1 and the label as scalar has a rank of 0. If you just need the dataset, I recommend to use the Tensorflow Dataset package, which has many ready-to-use datasets available. If want to solve your particular problem, one way to fix your data pipeline is by using Datasest.interleave and the Dataset.zip functions. # load positive sentences filepaths = list(tf.data.Dataset.list_files('aclImdb/train/pos/*.txt')) sentences_ds = tf.data.Dataset.from_tensor_slices(filepaths) sentences_ds = sentences_ds.interleave(lambda text_file: tf.data.TextLineDataset(text_file) ) sentences_ds = sentences_ds.map( lambda text: tf.strings.split(text) ) # dataset for labels, create 1 label per file labels = tf.constant(1, dtype="int32", shape=(len(filepaths))) label_ds = tf.data.Dataset.from_tensor_slices(labels) # combine text with label datasets dataset = tf.data.Dataset.zip( (sentences_ds, label_ds) ) print( list(dataset.as_numpy_iterator() )) First, you use the interleave function to combine multiple text datasets to one dataset. Next, you use tf.strings.split to split each text to its tokens. Then, you create a dataset for your positive labels. Finally, you combine the two datasets using zip. IMPORTANT: To train/run any DL models on your dataset, you will likely need further pre-processing for your sentences, e.g. build a vocabulary and train word-embeddings.
How to create my own dataset for keras model.fit() using Tensorflow(python)?
I want to train a simple classification neural network which can classify the data into 2 types, i.e. true or false. I have 29 data along with respective labels available with me. I want to parse this data to form a dataset which can be fed into model.fit() to train the neural network. Please suggest me how can I arrange the data with their respective labels. What to use, whether lists, dictionary, array? There are values of 2 fingerprints separated by '$' sign and whether they match or not (i.e. true or false) is separated by another '$' sign. A Fingerprint has 63 features separated by ','(comma) sign. So, Each line has the data of 2 fingerprints and true/false data. I have below data with me in following format: File Name : thumb_and_index.txt 239,1,255,255,255,255,2,0,130,3,1,105,24,152,0,192,126,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,192,0,192,0,0,0,0,0,0,0,147,18,19,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,101,22,154,0,240,30,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,71,150,212,$true 239,1,255,255,255,255,2,0,130,3,1,82,23,146,0,128,126,0,14,0,6,0,6,0,2,0,0,0,0,0,2,0,2,0,2,0,2,0,2,0,6,128,6,192,14,224,30,255,254,0,0,0,0,0,0,207,91,180,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,81,28,138,0,241,254,128,6,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,128,2,128,2,192,6,224,6,224,62,0,0,0,0,0,0,0,0,0,0,0,0,13,62,$true 239,1,255,255,255,255,2,0,130,3,1,92,29,147,0,224,0,192,0,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,192,0,192,0,224,0,224,2,240,2,248,6,255,14,76,16,0,0,0,0,19,235,73,181,0,0,0,0,$239,192,255,255,255,255,2,0,130,3,1,0,0,0,0,248,30,240,14,224,0,224,0,128,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,128,14,192,14,252,30,0,0,0,0,0,0,0,0,0,0,0,0,158,46,$false 239,1,255,255,255,255,2,0,130,3,1,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,0,0,0,0,0,0,217,85,88,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,90,27,135,0,252,254,224,126,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,190,148,$false 239,1,255,255,255,255,2,0,130,3,1,89,22,129,0,129,254,128,254,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,6,0,6,128,14,192,14,224,14,0,0,0,0,0,0,20,20,43,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,91,17,134,0,0,126,0,30,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,6,0,6,0,30,192,62,224,126,224,254,0,0,0,0,0,0,0,0,0,0,0,0,138,217,$true 239,1,255,255,255,255,2,0,130,3,1,71,36,143,0,128,254,0,14,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,6,80,18,0,0,0,0,153,213,11,95,83,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,94,30,140,0,129,254,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,192,6,0,0,0,0,0,0,0,0,0,0,0,0,54,13,$true 239,1,255,255,255,255,2,0,130,3,1,66,42,135,0,255,254,1,254,0,14,0,6,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,6,0,6,0,6,0,0,0,0,0,0,225,165,64,152,172,88,0,0,$239,1,255,255,255,255,2,0,130,3,1,62,29,137,0,255,254,249,254,240,6,224,2,224,0,224,0,224,0,224,0,224,0,224,0,224,0,240,0,240,0,240,0,240,0,240,0,240,2,0,0,0,0,0,0,0,0,0,0,0,0,0,98,$true 239,1,255,255,255,255,2,0,130,3,1,83,31,142,0,255,254,128,254,0,30,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,128,2,192,2,192,2,192,2,192,6,0,0,0,0,0,0,146,89,117,12,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,84,14,154,0,0,2,0,2,0,2,0,2,0,2,0,6,0,14,128,30,192,62,255,254,255,254,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,31,$false 239,1,255,255,255,255,2,0,130,3,1,66,41,135,0,255,254,248,62,128,30,0,14,0,14,0,14,0,14,0,14,0,14,0,6,0,6,0,6,0,14,0,14,0,14,192,14,224,14,0,0,0,0,0,0,105,213,155,107,95,23,0,0,$239,1,255,255,255,255,2,0,130,3,1,61,33,133,0,255,254,255,254,224,62,192,6,192,6,192,6,192,6,192,6,192,6,224,6,224,6,224,6,224,6,224,6,224,6,224,6,224,6,0,0,0,0,0,0,0,0,0,0,0,0,0,62,$false 239,1,255,255,255,255,2,0,130,3,1,88,31,119,0,0,14,0,14,0,6,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,133,59,150,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,97,21,137,0,128,14,0,6,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,0,0,0,0,0,0,0,0,0,0,0,80,147,210,$true 239,1,255,255,255,255,2,0,130,3,1,85,21,137,0,224,14,192,6,192,6,128,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,128,14,192,30,224,126,224,254,0,0,0,0,0,0,79,158,178,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,89,25,134,0,240,6,128,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,128,2,128,2,192,2,192,6,224,6,240,14,240,30,0,0,0,0,0,0,0,0,0,0,0,0,72,31,$true 239,1,255,255,255,255,2,0,130,3,1,90,25,128,0,241,254,0,30,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,6,0,6,192,14,0,0,0,0,0,0,225,153,189,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,96,12,153,0,192,14,128,6,128,6,128,6,0,6,128,2,128,2,128,2,128,6,128,6,192,14,240,30,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,18,$false 239,1,255,255,255,255,2,0,130,3,1,96,22,142,0,255,254,254,14,128,2,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,192,2,0,0,0,0,0,0,18,25,100,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,24,145,0,224,2,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,224,2,240,126,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,145,$false 239,1,255,255,255,255,2,0,130,3,1,71,33,117,0,129,254,0,30,0,14,0,14,0,6,0,6,0,2,0,2,0,6,0,6,0,6,0,6,0,6,128,14,192,14,240,30,240,254,0,0,0,0,0,0,235,85,221,57,17,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,31,112,0,255,254,0,62,0,62,0,62,0,14,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,128,14,224,62,0,0,0,0,0,0,0,0,0,0,0,0,30,170,$true 239,1,255,255,255,255,2,0,130,3,1,64,29,117,0,128,30,0,30,0,30,0,14,0,6,0,6,0,6,0,6,0,6,0,14,0,14,0,14,128,30,192,30,224,62,240,254,255,254,0,0,0,0,0,0,99,80,119,149,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,72,18,132,0,128,2,0,0,0,0,128,0,128,0,128,0,128,0,192,2,224,2,240,14,252,14,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,14,$false 239,1,255,255,255,255,2,0,130,3,1,82,16,132,0,255,254,255,254,255,254,240,30,224,14,224,14,192,6,192,6,192,2,192,2,192,2,192,2,192,2,192,2,192,1,224,2,240,6,0,0,0,0,0,0,215,21,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,85,23,130,0,240,30,192,14,128,14,128,6,128,2,128,2,128,2,128,2,128,2,128,0,192,0,192,2,192,2,224,2,224,6,240,6,248,30,0,0,0,0,0,0,0,0,0,0,0,0,0,62,$true 239,1,255,255,255,255,2,0,130,3,1,100,28,141,0,255,254,255,254,224,14,192,14,192,6,192,2,128,2,128,2,128,2,0,2,0,2,0,2,0,2,0,6,0,6,0,6,192,14,0,0,0,0,0,0,42,88,87,169,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,95,31,134,0,255,254,240,254,224,0,192,0,192,0,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,192,2,192,6,0,0,0,0,0,0,0,0,0,0,0,0,0,182,$true 239,1,255,255,255,255,2,0,130,3,1,88,35,121,0,255,14,240,6,224,7,192,2,192,2,192,2,192,2,192,2,192,2,192,2,192,2,224,2,224,2,224,2,224,2,224,2,224,6,0,0,0,0,0,0,36,81,48,225,153,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,81,43,112,0,252,62,248,14,224,2,192,2,192,2,192,0,192,0,192,0,192,0,192,0,192,0,192,0,224,0,224,2,224,2,224,2,224,6,0,0,0,0,0,0,0,0,0,0,0,0,0,76,$true 239,1,255,255,255,255,2,0,130,3,1,103,24,144,0,255,254,192,14,192,6,128,2,128,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,6,128,6,128,6,192,30,224,254,0,0,0,0,0,0,19,82,111,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,98,11,149,0,255,2,255,0,252,0,240,0,240,0,240,0,248,0,248,0,248,0,252,0,254,0,254,2,254,30,254,30,254,30,254,30,254,30,0,0,0,0,0,0,0,0,0,0,0,0,0,114,$false 239,1,255,255,255,255,2,0,130,3,1,92,23,123,0,255,254,255,30,252,6,240,2,224,0,192,0,192,0,192,0,224,0,224,0,224,0,224,2,224,2,224,2,224,2,224,6,224,6,0,0,0,0,0,0,35,161,251,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,52,37,125,0,255,254,255,254,224,254,192,30,192,14,128,14,128,14,128,14,128,14,128,14,128,14,128,14,128,6,0,2,0,2,0,2,192,2,0,0,0,0,0,0,0,0,0,0,0,0,0,110,$false 239,1,255,255,255,255,2,0,130,3,1,103,19,143,0,255,254,254,254,0,126,0,126,0,126,0,62,0,62,0,126,0,126,0,126,0,126,0,126,0,126,0,126,0,254,0,254,0,254,0,0,0,0,0,0,38,168,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,90,30,141,0,255,254,193,254,128,62,0,6,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,0,254,0,0,0,0,0,0,0,0,0,0,0,0,53,211,$true 239,1,255,255,255,255,2,0,130,3,1,93,34,137,0,255,254,225,254,192,14,192,2,192,2,192,2,192,2,192,0,192,0,192,0,192,0,192,0,192,0,224,2,224,2,240,6,240,14,0,0,0,0,0,0,101,4,252,164,28,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,88,31,140,0,255,254,192,62,192,14,192,14,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,128,2,128,6,192,6,224,14,240,30,0,0,0,0,0,0,0,0,0,0,0,0,10,97,$true 239,1,255,255,255,255,2,0,130,3,1,57,50,107,0,248,2,248,0,248,0,224,0,224,0,192,0,192,0,192,0,128,0,128,0,128,0,128,0,192,0,192,0,192,0,192,2,224,2,0,0,0,0,0,0,34,10,146,27,176,73,73,82,$239,1,255,255,255,255,2,0,130,3,1,54,42,111,0,255,254,255,254,254,126,252,6,240,2,224,2,224,2,224,0,224,0,224,0,224,0,224,0,224,0,224,0,224,0,192,0,192,0,0,0,0,0,0,0,0,0,0,0,0,0,0,225,$true 239,1,255,255,255,255,2,0,130,3,1,103,18,142,0,241,254,224,254,128,126,128,126,0,62,0,30,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,209,21,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,103,10,139,0,255,254,255,254,255,254,225,254,192,254,192,254,192,126,128,62,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,163,$true 239,1,255,255,255,255,2,0,130,3,1,85,21,132,0,248,2,248,2,248,0,240,0,240,0,240,0,240,0,240,0,240,0,240,0,248,0,248,0,252,0,252,0,252,0,254,2,255,6,0,0,0,0,0,0,94,23,110,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,26,133,0,129,254,128,62,0,62,0,62,0,62,0,62,0,30,0,30,0,30,0,30,0,30,0,30,0,30,0,30,128,30,192,14,224,14,0,0,0,0,0,0,0,0,0,0,0,0,222,36,$true 239,1,255,255,255,255,2,0,130,3,1,87,28,141,0,255,254,255,254,224,254,224,126,224,126,0,14,0,2,0,2,0,2,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,2,0,0,0,0,0,0,143,231,78,148,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,89,30,139,0,255,254,248,254,240,30,224,14,224,14,192,6,192,2,128,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,26,213,$true 239,1,255,255,255,255,2,0,130,3,1,93,25,136,0,255,254,193,254,0,254,0,62,0,30,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,148,210,91,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,95,23,145,0,254,254,252,30,240,2,224,0,224,0,224,0,192,0,192,0,192,0,192,6,192,6,192,6,192,6,192,6,192,6,224,6,224,14,0,0,0,0,0,0,0,0,0,0,0,0,0,30,$false 239,1,255,255,255,255,2,0,130,3,1,85,27,138,0,255,254,240,126,224,30,192,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,30,0,30,0,30,192,62,224,62,0,0,0,0,0,0,85,17,74,101,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,105,19,144,0,192,254,128,126,0,62,0,30,128,30,128,30,128,14,192,14,192,14,192,14,224,14,224,14,240,14,240,14,248,14,254,30,255,30,0,0,0,0,0,0,0,0,0,0,0,0,0,254,$false 239,1,255,255,255,255,2,0,130,3,1,86,37,116,0,255,254,254,14,252,6,248,2,240,0,240,0,224,0,192,0,192,0,128,0,0,0,0,2,0,2,0,2,0,2,0,6,0,6,0,0,0,0,0,0,94,157,90,28,219,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,99,26,130,0,255,254,248,14,240,2,224,0,192,0,192,0,192,0,128,0,192,0,192,0,192,0,192,0,224,0,240,2,248,6,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,213,$true I have used this code trying to parse the data: import tensorflow as tf import os import array as arr import numpy as np import json os.environ["TF_CPP_MIN_LOG_LEVEL"]="2" f= open("thumb_and_index.txt","r") dataset = [] if f.mode == 'r': contents =f.read() #list of lines lines = contents.splitlines() print("No. of lines : "+str(len(lines))) for line in lines: words = line.split(',') mainlist = [] list = [] flag = 0 for word in words: print("word : " + word) if '$' in word: if flag == 1: mainlist.append(list) mainlist.append(word[1:]) dataset.append(mainlist) else: mainlist.append(list) del list[0:len(list)] list.append(int(word[1:])) flag = flag + 1 else: list.append(int(word)) print(json.dumps(dataset, indent = 4)) I want to feed the parsed data into model.fit() using keras in tensorflow(python). Also I want to ask about the neural network. How many layers and nodes should I keep in my neural network? Suggest a starting point.
there's a plenty ways to do that (formating the data), you can create 2D matrix for the data that has 62 columns for the data and another array that handles the results for this data (X_data,Y_data). also you can use pandas to create dataframes for the data (same as arrays, bu it's better to show and visualize the data). example to read the textfile into pandas dataframe import pandas df = pandas.read_table('./input/dists.txt', delim_whitespace=True, names=('A', 'B', 'C')) split the data into x&y then fit it in your model for the size of the hidden layers in your neural, it's well known that the more layers you add the more accurate results you get (without considering overfitting) , so that depends on your data. I suggest you to start with a sequential layers as follows (62->2048->1024->512->128->64->sigmoid)
The best approach, especially assuming that dataset is large, is to use the tf.data dataset. There's a CSV reader built right in. The dataset api provides all the functionality you need to preprocess the dataset, it provides built-in multi-core processing, and quite a bit more. Once you have the dataset built Keras will accept it as an input directly, so fit(my_dataset, inputs=... outputs=...). The structure of the dataset api takes a little learning, but it's well worth it. Here's the primary guide with lots of examples: https://www.tensorflow.org/guide/datasets Scroll down to the section on 'Import CSV data' for poignant examples. Here's a nice example of using the dataset API with keras: How to Properly Combine TensorFlow's Dataset API and Keras?
Finding closest related words using word2vec
My goal is to find most relevant words given set of keywords using word2vec. For example, if I have a set of words [girl, kite, beach], I would like relevants words to be output from word2vec: [flying, swimming, swimsuit...] I understand that word2vec will vectorize a word based on the context of surround words. So what I did, was use the following function: most_similar_cosmul([girl, kite, beach]) However, it seems to give out words not very related to the set of keywords: ['charade', 0.30288437008857727] ['kinetic', 0.3002534508705139] ['shells', 0.29911646246910095] ['kites', 0.2987399995326996] ['7-9', 0.2962781488895416] ['showering', 0.2953910827636719] ['caribbean', 0.294752299785614] ['hide-and-go-seek', 0.2939240336418152] ['turbine', 0.2933803200721741] ['teenybopper', 0.29288050532341003] ['rock-paper-scissors', 0.2928623557090759] ['noisemaker', 0.2927709221839905] ['scuba-diving', 0.29180505871772766] ['yachting', 0.2907838821411133] ['cherub', 0.2905363440513611] ['swimmingpool', 0.290039986371994] ['coastline', 0.28998953104019165] ['Dinosaur', 0.2893030643463135] ['flip-flops', 0.28784963488578796] ['guardsman', 0.28728148341178894] ['frisbee', 0.28687697649002075] ['baltic', 0.28405341506004333] ['deprive', 0.28401875495910645] ['surfs', 0.2839275300502777] ['outwear', 0.28376665711402893] ['diverstiy', 0.28341981768608093] ['mid-air', 0.2829524278640747] ['kickboard', 0.28234976530075073] ['tanning', 0.281939834356308] ['admiration', 0.28123530745506287] ['Mediterranean', 0.281186580657959] ['cycles', 0.2807052433490753] ['teepee', 0.28070521354675293] ['progeny', 0.2775532305240631] ['starfish', 0.2775339186191559] ['romp', 0.27724218368530273] ['pebbles', 0.2771730124950409] ['waterpark', 0.27666303515434265] ['tarzan', 0.276429146528244] ['lighthouse', 0.2756190896034241] ['captain', 0.2755546569824219] ['popsicle', 0.2753356397151947] ['Pohoda', 0.2751699686050415] ['angelic', 0.27499720454216003] ['african-american', 0.27493417263031006] ['dam', 0.2747344970703125] ['aura', 0.2740659713745117] ['Caribbean', 0.2739778757095337] ['necking', 0.27346789836883545] ['sleight', 0.2733519673347473] This is the code I used to train word2vec def train(data_filepath, epochs=300, num_features=300, min_word_count=2, context_size=7, downsampling=1e-3, seed=1, ckpt_filename=None): """ Train word2vec model data_filepath path of the data file in csv format :param epochs: number of times to train :param num_features: increase to improve generality, more computationally expensive to train :param min_word_count: minimum frequency of word. Word with lower frequency will not be included in training data :param context_size: context window length :param downsampling: reduce frequency for frequent keywords :param seed: make results reproducible for random generator. Same seed means, after training model produces same results. :returns path of the checkpoint after training """ if ckpt_filename == None: data_base_filename = os.path.basename(data_filepath) data_filename = os.path.splitext(data_base_filename)[0] ckpt_filename = data_filename + ".wv.ckpt" num_workers = multiprocessing.cpu_count() logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) nltk.download("punkt") nltk.download("stopwords") print("Training %s ..." % data_filepath) sentences = _get_sentences(data_filepath) word2vec = w2v.Word2Vec( sg=1, seed=seed, workers=num_workers, size=num_features, min_count=min_word_count, window=context_size, sample=downsampling ) word2vec.build_vocab(sentences) print("Word2vec vocab length: %d" % len(word2vec.wv.vocab)) word2vec.train(sentences, total_examples=len(sentences), epochs=epochs) return _save_ckpt(word2vec, ckpt_filename) def _save_ckpt(model, ckpt_filename): if not os.path.exists("checkpoints"): os.makedirs("checkpoints") ckpt_filepath = os.path.join("checkpoints", ckpt_filename) model.save(ckpt_filepath) return ckpt_filepath def _get_sentences(data_filename): print("Found Data:") sentences = [] print("Reading '{0}'...".format(data_filename)) with codecs.open(data_filename, "r") as data_file: reader = csv.DictReader(data_file) for row in reader: sentences.append(ast.literal_eval((row["highscores"]))) print("There are {0} sentences".format(len(sentences))) return sentences if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='Train Word2vec model') parser.add_argument('data_filepath', help='path to training CSV file.') args = parser.parse_args() data_filepath = args.data_filepath train(data_filepath) This is a sample of training data used for word2vec: 22751473,"[""lover"", ""sweetheart"", ""couple"", ""dietary"", ""meal""]" 28738542,"[""mallotus"", ""villosus"", ""shishamo"", ""smelt"", ""dried"", ""fish"", ""spirinchus"", ""lanceolatus""]" 25163686,"[""Snow"", ""Removal"", ""snow"", ""clearing"", ""female"", ""females"", ""woman"", ""women"", ""blower"", ""snowy"", ""road"", ""operate""]" 32837025,"[""milk"", ""breakfast"", ""drink"", ""cereal"", ""eating""]" 23828321,"[""jogging"", ""female"", ""females"", ""lady"", ""woman"", ""women"", ""running"", ""person""]" 22874156,"[""lover"", ""sweetheart"", ""heterosexual"", ""couple"", ""man"", ""and"", ""woman"", ""consulting"", ""hear"", ""listening""] For prediction, I simply used the following function for a set of keywords: most_similar_cosmul I was wondering whether it is possible to find relevant keywords with word2vec. If it is not, then what machine learning model would be more suitable for this. Any insights would be very helpful
When supplying multiple positive-word examples, like ['girl', 'kite', 'beach'], to most_similar()/most_similar_cosmul(), the vectors for those words will be averaged-together first, then a list of words most similar to the average returned. Those might not be as obviously related to any one of the words than a simple check of a single word. So: When you try most_similar() (or most_similar_cosmul()) on a single word, what kind of results do you get? Are they words that seem related to the input word, in the way that you care about? If not, you have deeper problems in your setup that should be fixed before trying a multi-word similarity. Word2Vec gets its usual results from (1) lots of training data; and (2) natural-language sentences. With enough data, a typical number of epochs training-passes (and thus the default) is 5. You can sometimes, somewhat make up for less data by using more epoch iterations, or a smaller vector size, but not always. It's not clear how much data you have. Also, your example rows aren't real natural-language sentences – they appear to have had some other preprocessing/reordering applied. That may be hurting rather than helping. Word-vectors often improve by throwing away more low-frequency words (increasing min_count above the default 5, rather than reducing it to 2.) Low-frequency words don't have enough examples to get good vectors – and the few examples they have, even if repeated with many iterations, tend to be idiosyncratic examples of the words' usage, not the generalizable broad representations that you'd get from many varied examples. And by keeping these doomed-to-be-weak words still in the training-data, the training of other more-frequent words is interfered with. (When you get a word that you don't think belongs in a most-similar ranking, it may be a rare-word that, given its its few occurrence contexts, found its way to those coordinates as the least-bad location among plenty of other unhelpful coordinates.) If you do get good results from single-word checks, but not from the average-of-multiple-words, the results might improve with more and better data, or adjusted training parameters – but to achieve that you'd need to more rigorously define what you consider good results. (Your existing list doesn't look that bad to me: it includes many words related to sun/sand/beach activities.) On the other hand, your expectations of Word2Vec may be too high: it may not be that the average of ['girl', 'kite', 'beach'] is necessarily closed to those desired words, compared to the individual words themselves, or that may only be achievable with lots of dataset/parameter tweaking.