scikit-learn multi dimensional features

scikit-learn multi dimensional features - python

i have a question concerning scikit-learn.
Is it possible to merge a multi dimensional feature list to one feature vector.
For example:
I have results from an application analysis and I would like to represent an application with one feature vector.
In case of network traffic, an analysis result looks like the following:
traffic = [
{
"http_body": "http_body_data", "length": 1024
},
{
"http_body2": "http_body_data2", "length": 2048
},
... and many more
]
So each dict in traffic list describes one network activity of a specific application.
I would like to generate a feature vector which contains all these information for one application to be able to generate a model out of the analysis results from a variety of applications.
How can I do this with scikit-learn?
Thank you in advance!

If every application sends response of the same length (e.g. the first have length 1024, the second - 2048 etc.) you can just join all the results into one vector.
For example if the responses in traffic is serialized lists (for example json).
def merge_feature_vector(traffic):
result = []
for id, data in enumerate(traffic):
result.extend(json.loads(data['http_body%s' % id]))
return result
Another approach is to use sklearn.feature_hasher.
For example
def encode_traffic(traffic):
result = {}
for id, data in enumerate(traffic):
result['html_body%s' % id] = data['html_body%s' % id]
return result
...
features = [encode_traffic(traffic) for traffic in train]
h = FeatureHasher(n_features=10)
features = h.fit_transform(features)
features.toarray()
clf = RandomForestClassifier(n_estimators=100)
clf.train(features)
By the way it mostly depends on what you have in http_data

You can't have non-numeric features the value of a feature should be a number.
so what do you do if you have : ip:127.0.0.1 ip:192.168.0.1 ip:220.220.220.220 ?
you create three features : ip_127.0.0.1, ip.192.168.0.1 and ip.220.220.220.220 and set the value of the first one 1 and the other two zero if the value of ip is 127.0.0.1.
If ip:val can have say more than 10 values you just create 10 features for the most common ones and create an ip_other feature and set that for all other samples who have another ip address.

Related

"... has insufficient rank for batching." What is the problem with this 3 line code?

this is my first question here.
I've been wanting to create a dataset with the popular IMDb dataset for learning purpose. The directories are as follows: .../train/pos/ and .../train/neg/ . I created a function which will merge text files with its labels and getting a error. I need your help to debug!
def datasetcreate(filepath, label):
filepaths = tf.data.Dataset.list_files(filepath)
return tf.stack([tf.data.Dataset.from_tensor_slices((_, tf.constant(label, dtype='int32'))) for _ in tf.data.TextLineDataset(filepaths)])
datasetcreate(['aclImdb/train/pos/*.txt'],1)
And this is the error I'm getting:
ValueError: Value tf.Tensor(b'An American in Paris was, in many ways, the ultimate.....dancers of all time.', shape=(), dtype=string) has insufficient rank for batching.
Why does this happen and what can I do to get rid of this? Thanks.

Your code has two problems:
First, the way you load your TextLineDatasets, your loaded tensors contain string objects, which have an empty shape associated, i.e. a rank of zero. The rank of a tensor is the length of the shape property.
Secondly, you are trying to stack two tensors with different rank, which is would throw another error because, a sentence (a sequence of tokens) has a rank of 1 and the label as scalar has a rank of 0.
If you just need the dataset, I recommend to use the Tensorflow Dataset package, which has many ready-to-use datasets available.
If want to solve your particular problem, one way to fix your data pipeline is by using Datasest.interleave and the Dataset.zip functions.
# load positive sentences
filepaths = list(tf.data.Dataset.list_files('aclImdb/train/pos/*.txt'))
sentences_ds = tf.data.Dataset.from_tensor_slices(filepaths)
sentences_ds = sentences_ds.interleave(lambda text_file: tf.data.TextLineDataset(text_file) )
sentences_ds = sentences_ds.map( lambda text: tf.strings.split(text) )
# dataset for labels, create 1 label per file
labels = tf.constant(1, dtype="int32", shape=(len(filepaths)))
label_ds = tf.data.Dataset.from_tensor_slices(labels)
# combine text with label datasets
dataset = tf.data.Dataset.zip( (sentences_ds, label_ds) )
print( list(dataset.as_numpy_iterator() ))
First, you use the interleave function to combine multiple text datasets to one dataset. Next, you use tf.strings.split to split each text to its tokens. Then, you create a dataset for your positive labels. Finally, you combine the two datasets using zip.
IMPORTANT: To train/run any DL models on your dataset, you will likely need further pre-processing for your sentences, e.g. build a vocabulary and train word-embeddings.

NLP for multi feature data set using TensorFlow

I am just a beginner in this subject, I have tested some NN for image recognition as well as using NLP for sequence classification.
This second topic is interesting for me.
using
sentences = [
'some test sentence',
'and the second sentence'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sentences = tokenizer.texts_to_sequences(sentences)
will result with an array of size [n,1] where n is word size in sentence. And assuming I have implemented padding correctly each Training example in set will be size of [n,1] where n is the max sentence length.
that prepared training set I can pass into keras model.fit
what when I have multiple features in my data set?
let's say I would like to build an event prioritization algorithm and my data structure would look like:
[event_description, event_category, event_location, label]
trying to tokenize such array would result in [n,m] matrix where n is maximum word length and m is the feature number
how to prepare such a dataset so a model could be trained on such data?
would this approach be ok:
# Going through training set to get all features into specific ararys
for data in dataset:
training_sentence.append(data['event_description'])
training_category.append(data['event_category'])
training_location.append(data['event_location'])
training_labels.append(data['label'])
# Tokenize each array which contains tokenized value
tokenizer.fit_on_texts(training_sentence)
tokenizer.fit_on_texts(training_category)
tokenizer.fit_on_texts(training_location)
sequences = tokenizer.texts_to_sequences(training_sentence)
categories = tokenizer.texts_to_sequences(training_category)
locations = tokenizer.texts_to_sequences(training_location)
# Concatenating arrays with features into one
training_example = numpy.concatenate([sequences,categories, locations])
#ommiting model definition, training the model
model.fit(training_example, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
I haven't been testing it yet. I just want to make sure if I understand everything correctly and if my assumptions are correct.
Is this a correct approach to create NPL using NN?

I know of two common ways to manage multiple input sequences, and your approach lands somewhere between them.
One approach is to design a multi-input model with each of your text columns as a different input. They can share the vocabulary and/or embedding layer later, but for now you still need a distinct input sub-model for each of description, category, etc.
Each of these becomes an input to the network, using the Model(inputs=[...], outputs=rest_of_nn) syntax. You will need to design rest_of_nn so it can take multiple inputs. This can be as simple as your current concatenation, or you could use additional layers to do the synthesis.
It could look something like this:
# Build separate vocabularies. This could be shared.
desc_tokenizer = Tokenizer()
desc_tokenizer.fit_on_texts(training_sentence)
desc_vocab_size = len(desc_tokenizer.word_index)
categ_tokenizer = Tokenizer()
categ_tokenizer.fit_on_texts(training_category)
categ_vocab_size = len(categ_tokenizer.word_index)
# Inputs.
desc = Input(shape=(desc_maxlen,))
categ = Input(shape=(categ_maxlen,))
# Input encodings, opting for different embeddings.
# Descriptions go through an LSTM as a demo of extra processing.
embedded_desc = Embedding(desc_vocab_size, desc_embed_size, input_length=desc_maxlen)(desc)
encoded_desc = LSTM(categ_embed_size, return_sequences=True)(embedded_desc)
encoded_categ = Embedding(categ_vocab_size, categ_embed_size, input_length=categ_maxlen)(categ)
# Rest of the NN, which knows how to put everything together to get an output.
merged = concatenate([encoded_desc, encoded_categ], axis=1)
rest_of_nn = Dense(hidden_size, activation='relu')(merged)
rest_of_nn = Flatten()(rest_of_nn)
rest_of_nn = Dense(output_size, activation='softmax')(rest_of_nn)
# Create the model, assuming some sort of classification problem.
model = Model(inputs=[desc, categ], outputs=rest_of_nn)
model.compile(optimizer='adam', loss=K.categorical_crossentropy)
The second approach is to concatenate all of your data before encoding it, and then treat everything as a more standard single-sequence problem after that. It is common to use a unique token to separate or define the different fields, similar to BOS and EOS for the beginning and end of the sequence.
It would look something like this:
XXBOS XXDESC This event will be fun. XXCATEG leisure XXLOC Seattle, WA XXEOS
You can also do end tags for the fields like DESCXX, omit the BOS and EOS tokens, and generally mix and match however you want. You can even use this to combine some of your input sequences, but then use a multi-input model as above to merge the rest.
Speaking of mixing and matching, you also have the option to treat some of your inputs directly as an embedding. Low-cardinality fields like category and location do not need to be tokenized, and can be embedded directly without any need to split into tokens. That is, they don't need to be a sequence.
If you are looking for a reference, I enjoyed this paper on Large Scale Product Categorization using Structured and Unstructured Attributes. It tests all or most of the ideas I have just outlined, on real data at scale.

How to create my own dataset for keras model.fit() using Tensorflow(python)?

I want to train a simple classification neural network which can classify the data into 2 types, i.e. true or false.
I have 29 data along with respective labels available with me. I want to parse this data to form a dataset which can be fed into model.fit() to train the neural network.
Please suggest me how can I arrange the data with their respective labels. What to use, whether lists, dictionary, array?
There are values of 2 fingerprints separated by '$' sign and whether they match or not (i.e. true or false) is separated by another '$' sign.
A Fingerprint has 63 features separated by ','(comma) sign.
So, Each line has the data of 2 fingerprints and true/false data.
I have below data with me in following format:
File Name : thumb_and_index.txt
239,1,255,255,255,255,2,0,130,3,1,105,24,152,0,192,126,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,192,0,192,0,0,0,0,0,0,0,147,18,19,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,101,22,154,0,240,30,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,71,150,212,$true
239,1,255,255,255,255,2,0,130,3,1,82,23,146,0,128,126,0,14,0,6,0,6,0,2,0,0,0,0,0,2,0,2,0,2,0,2,0,2,0,6,128,6,192,14,224,30,255,254,0,0,0,0,0,0,207,91,180,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,81,28,138,0,241,254,128,6,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,128,2,128,2,192,6,224,6,224,62,0,0,0,0,0,0,0,0,0,0,0,0,13,62,$true
239,1,255,255,255,255,2,0,130,3,1,92,29,147,0,224,0,192,0,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,192,0,192,0,224,0,224,2,240,2,248,6,255,14,76,16,0,0,0,0,19,235,73,181,0,0,0,0,$239,192,255,255,255,255,2,0,130,3,1,0,0,0,0,248,30,240,14,224,0,224,0,128,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,128,14,192,14,252,30,0,0,0,0,0,0,0,0,0,0,0,0,158,46,$false
239,1,255,255,255,255,2,0,130,3,1,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,0,0,0,0,0,0,217,85,88,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,90,27,135,0,252,254,224,126,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,190,148,$false
239,1,255,255,255,255,2,0,130,3,1,89,22,129,0,129,254,128,254,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,6,0,6,128,14,192,14,224,14,0,0,0,0,0,0,20,20,43,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,91,17,134,0,0,126,0,30,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,6,0,6,0,30,192,62,224,126,224,254,0,0,0,0,0,0,0,0,0,0,0,0,138,217,$true
239,1,255,255,255,255,2,0,130,3,1,71,36,143,0,128,254,0,14,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,6,80,18,0,0,0,0,153,213,11,95,83,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,94,30,140,0,129,254,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,192,6,0,0,0,0,0,0,0,0,0,0,0,0,54,13,$true
239,1,255,255,255,255,2,0,130,3,1,66,42,135,0,255,254,1,254,0,14,0,6,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,6,0,6,0,6,0,0,0,0,0,0,225,165,64,152,172,88,0,0,$239,1,255,255,255,255,2,0,130,3,1,62,29,137,0,255,254,249,254,240,6,224,2,224,0,224,0,224,0,224,0,224,0,224,0,224,0,240,0,240,0,240,0,240,0,240,0,240,2,0,0,0,0,0,0,0,0,0,0,0,0,0,98,$true
239,1,255,255,255,255,2,0,130,3,1,83,31,142,0,255,254,128,254,0,30,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,128,2,192,2,192,2,192,2,192,6,0,0,0,0,0,0,146,89,117,12,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,84,14,154,0,0,2,0,2,0,2,0,2,0,2,0,6,0,14,128,30,192,62,255,254,255,254,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,31,$false
239,1,255,255,255,255,2,0,130,3,1,66,41,135,0,255,254,248,62,128,30,0,14,0,14,0,14,0,14,0,14,0,14,0,6,0,6,0,6,0,14,0,14,0,14,192,14,224,14,0,0,0,0,0,0,105,213,155,107,95,23,0,0,$239,1,255,255,255,255,2,0,130,3,1,61,33,133,0,255,254,255,254,224,62,192,6,192,6,192,6,192,6,192,6,192,6,224,6,224,6,224,6,224,6,224,6,224,6,224,6,224,6,0,0,0,0,0,0,0,0,0,0,0,0,0,62,$false
239,1,255,255,255,255,2,0,130,3,1,88,31,119,0,0,14,0,14,0,6,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,133,59,150,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,97,21,137,0,128,14,0,6,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,0,0,0,0,0,0,0,0,0,0,0,80,147,210,$true
239,1,255,255,255,255,2,0,130,3,1,85,21,137,0,224,14,192,6,192,6,128,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,128,14,192,30,224,126,224,254,0,0,0,0,0,0,79,158,178,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,89,25,134,0,240,6,128,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,128,2,128,2,192,2,192,6,224,6,240,14,240,30,0,0,0,0,0,0,0,0,0,0,0,0,72,31,$true
239,1,255,255,255,255,2,0,130,3,1,90,25,128,0,241,254,0,30,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,6,0,6,192,14,0,0,0,0,0,0,225,153,189,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,96,12,153,0,192,14,128,6,128,6,128,6,0,6,128,2,128,2,128,2,128,6,128,6,192,14,240,30,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,18,$false
239,1,255,255,255,255,2,0,130,3,1,96,22,142,0,255,254,254,14,128,2,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,192,2,0,0,0,0,0,0,18,25,100,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,24,145,0,224,2,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,224,2,240,126,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,145,$false
239,1,255,255,255,255,2,0,130,3,1,71,33,117,0,129,254,0,30,0,14,0,14,0,6,0,6,0,2,0,2,0,6,0,6,0,6,0,6,0,6,128,14,192,14,240,30,240,254,0,0,0,0,0,0,235,85,221,57,17,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,31,112,0,255,254,0,62,0,62,0,62,0,14,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,128,14,224,62,0,0,0,0,0,0,0,0,0,0,0,0,30,170,$true
239,1,255,255,255,255,2,0,130,3,1,64,29,117,0,128,30,0,30,0,30,0,14,0,6,0,6,0,6,0,6,0,6,0,14,0,14,0,14,128,30,192,30,224,62,240,254,255,254,0,0,0,0,0,0,99,80,119,149,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,72,18,132,0,128,2,0,0,0,0,128,0,128,0,128,0,128,0,192,2,224,2,240,14,252,14,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,14,$false
239,1,255,255,255,255,2,0,130,3,1,82,16,132,0,255,254,255,254,255,254,240,30,224,14,224,14,192,6,192,6,192,2,192,2,192,2,192,2,192,2,192,2,192,1,224,2,240,6,0,0,0,0,0,0,215,21,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,85,23,130,0,240,30,192,14,128,14,128,6,128,2,128,2,128,2,128,2,128,2,128,0,192,0,192,2,192,2,224,2,224,6,240,6,248,30,0,0,0,0,0,0,0,0,0,0,0,0,0,62,$true
239,1,255,255,255,255,2,0,130,3,1,100,28,141,0,255,254,255,254,224,14,192,14,192,6,192,2,128,2,128,2,128,2,0,2,0,2,0,2,0,2,0,6,0,6,0,6,192,14,0,0,0,0,0,0,42,88,87,169,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,95,31,134,0,255,254,240,254,224,0,192,0,192,0,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,192,2,192,6,0,0,0,0,0,0,0,0,0,0,0,0,0,182,$true
239,1,255,255,255,255,2,0,130,3,1,88,35,121,0,255,14,240,6,224,7,192,2,192,2,192,2,192,2,192,2,192,2,192,2,192,2,224,2,224,2,224,2,224,2,224,2,224,6,0,0,0,0,0,0,36,81,48,225,153,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,81,43,112,0,252,62,248,14,224,2,192,2,192,2,192,0,192,0,192,0,192,0,192,0,192,0,192,0,224,0,224,2,224,2,224,2,224,6,0,0,0,0,0,0,0,0,0,0,0,0,0,76,$true
239,1,255,255,255,255,2,0,130,3,1,103,24,144,0,255,254,192,14,192,6,128,2,128,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,6,128,6,128,6,192,30,224,254,0,0,0,0,0,0,19,82,111,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,98,11,149,0,255,2,255,0,252,0,240,0,240,0,240,0,248,0,248,0,248,0,252,0,254,0,254,2,254,30,254,30,254,30,254,30,254,30,0,0,0,0,0,0,0,0,0,0,0,0,0,114,$false
239,1,255,255,255,255,2,0,130,3,1,92,23,123,0,255,254,255,30,252,6,240,2,224,0,192,0,192,0,192,0,224,0,224,0,224,0,224,2,224,2,224,2,224,2,224,6,224,6,0,0,0,0,0,0,35,161,251,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,52,37,125,0,255,254,255,254,224,254,192,30,192,14,128,14,128,14,128,14,128,14,128,14,128,14,128,14,128,6,0,2,0,2,0,2,192,2,0,0,0,0,0,0,0,0,0,0,0,0,0,110,$false
239,1,255,255,255,255,2,0,130,3,1,103,19,143,0,255,254,254,254,0,126,0,126,0,126,0,62,0,62,0,126,0,126,0,126,0,126,0,126,0,126,0,126,0,254,0,254,0,254,0,0,0,0,0,0,38,168,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,90,30,141,0,255,254,193,254,128,62,0,6,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,0,254,0,0,0,0,0,0,0,0,0,0,0,0,53,211,$true
239,1,255,255,255,255,2,0,130,3,1,93,34,137,0,255,254,225,254,192,14,192,2,192,2,192,2,192,2,192,0,192,0,192,0,192,0,192,0,192,0,224,2,224,2,240,6,240,14,0,0,0,0,0,0,101,4,252,164,28,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,88,31,140,0,255,254,192,62,192,14,192,14,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,128,2,128,6,192,6,224,14,240,30,0,0,0,0,0,0,0,0,0,0,0,0,10,97,$true
239,1,255,255,255,255,2,0,130,3,1,57,50,107,0,248,2,248,0,248,0,224,0,224,0,192,0,192,0,192,0,128,0,128,0,128,0,128,0,192,0,192,0,192,0,192,2,224,2,0,0,0,0,0,0,34,10,146,27,176,73,73,82,$239,1,255,255,255,255,2,0,130,3,1,54,42,111,0,255,254,255,254,254,126,252,6,240,2,224,2,224,2,224,0,224,0,224,0,224,0,224,0,224,0,224,0,224,0,192,0,192,0,0,0,0,0,0,0,0,0,0,0,0,0,0,225,$true
239,1,255,255,255,255,2,0,130,3,1,103,18,142,0,241,254,224,254,128,126,128,126,0,62,0,30,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,209,21,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,103,10,139,0,255,254,255,254,255,254,225,254,192,254,192,254,192,126,128,62,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,163,$true
239,1,255,255,255,255,2,0,130,3,1,85,21,132,0,248,2,248,2,248,0,240,0,240,0,240,0,240,0,240,0,240,0,240,0,248,0,248,0,252,0,252,0,252,0,254,2,255,6,0,0,0,0,0,0,94,23,110,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,26,133,0,129,254,128,62,0,62,0,62,0,62,0,62,0,30,0,30,0,30,0,30,0,30,0,30,0,30,0,30,128,30,192,14,224,14,0,0,0,0,0,0,0,0,0,0,0,0,222,36,$true
239,1,255,255,255,255,2,0,130,3,1,87,28,141,0,255,254,255,254,224,254,224,126,224,126,0,14,0,2,0,2,0,2,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,2,0,0,0,0,0,0,143,231,78,148,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,89,30,139,0,255,254,248,254,240,30,224,14,224,14,192,6,192,2,128,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,26,213,$true
239,1,255,255,255,255,2,0,130,3,1,93,25,136,0,255,254,193,254,0,254,0,62,0,30,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,148,210,91,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,95,23,145,0,254,254,252,30,240,2,224,0,224,0,224,0,192,0,192,0,192,0,192,6,192,6,192,6,192,6,192,6,192,6,224,6,224,14,0,0,0,0,0,0,0,0,0,0,0,0,0,30,$false
239,1,255,255,255,255,2,0,130,3,1,85,27,138,0,255,254,240,126,224,30,192,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,30,0,30,0,30,192,62,224,62,0,0,0,0,0,0,85,17,74,101,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,105,19,144,0,192,254,128,126,0,62,0,30,128,30,128,30,128,14,192,14,192,14,192,14,224,14,224,14,240,14,240,14,248,14,254,30,255,30,0,0,0,0,0,0,0,0,0,0,0,0,0,254,$false
239,1,255,255,255,255,2,0,130,3,1,86,37,116,0,255,254,254,14,252,6,248,2,240,0,240,0,224,0,192,0,192,0,128,0,0,0,0,2,0,2,0,2,0,2,0,6,0,6,0,0,0,0,0,0,94,157,90,28,219,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,99,26,130,0,255,254,248,14,240,2,224,0,192,0,192,0,192,0,128,0,192,0,192,0,192,0,192,0,224,0,240,2,248,6,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,213,$true
I have used this code trying to parse the data:
import tensorflow as tf
import os
import array as arr
import numpy as np
import json
os.environ["TF_CPP_MIN_LOG_LEVEL"]="2"
f= open("thumb_and_index.txt","r")
dataset = []
if f.mode == 'r':
contents =f.read()
#list of lines
lines = contents.splitlines()
print("No. of lines : "+str(len(lines)))
for line in lines:
words = line.split(',')
mainlist = []
list = []
flag = 0
for word in words:
print("word : " + word)
if '$' in word:
if flag == 1:
mainlist.append(list)
mainlist.append(word[1:])
dataset.append(mainlist)
else:
mainlist.append(list)
del list[0:len(list)]
list.append(int(word[1:]))
flag = flag + 1
else:
list.append(int(word))
print(json.dumps(dataset, indent = 4))
I want to feed the parsed data into model.fit() using keras in tensorflow(python).
Also I want to ask about the neural network. How many layers and nodes should I keep in my neural network? Suggest a starting point.

there's a plenty ways to do that (formating the data), you can create 2D matrix for the data that has 62 columns for the data and another array that handles the results for this data (X_data,Y_data).
also you can use pandas to create dataframes for the data (same as arrays, bu it's better to show and visualize the data).
example to read the textfile into pandas dataframe
import pandas
df = pandas.read_table('./input/dists.txt', delim_whitespace=True, names=('A', 'B', 'C'))
split the data into x&y then fit it in your model
for the size of the hidden layers in your neural, it's well known that the more layers you add the more accurate results you get (without considering overfitting) , so that depends on your data.
I suggest you to start with a sequential layers as follows (62->2048->1024->512->128->64->sigmoid)

The best approach, especially assuming that dataset is large, is to use the tf.data dataset. There's a CSV reader built right in. The dataset api provides all the functionality you need to preprocess the dataset, it provides built-in multi-core processing, and quite a bit more.
Once you have the dataset built Keras will accept it as an input directly, so fit(my_dataset, inputs=... outputs=...).
The structure of the dataset api takes a little learning, but it's well worth it. Here's the primary guide with lots of examples:
https://www.tensorflow.org/guide/datasets
Scroll down to the section on 'Import CSV data' for poignant examples.
Here's a nice example of using the dataset API with keras: How to Properly Combine TensorFlow's Dataset API and Keras?

How to plot a document topic distribution in structural topic modeling R-package?

If I am using python Sklearn for LDA topic modeling, I can use the transform function to get a "document topic distribution" of the LDA-results like here:
document_topic_distribution = lda_model.transform(document_term_matrix)
Now I tried also the R structural topic models (stm) package and i want get the same. Is there any function in the stm package, which can produce the same thing (document topic distribution)?
I have the stm-object created as follows:
stm_model <- stm(documents = out$documents, vocab = out$vocab,
K = number_of_topics, data = out$meta,
max.em.its = 75, init.type = "Spectral" )
But i didn't find out how I can get the desired distribution out of this object. The documentation didn't really help me aswell.

As emilliman5 pointed out, your stm_model provides access to the underlying parameters of the model, as is shown in the documentation.
Indeed, the theta parameter is a
Number of Documents by Number of Topics matrix of topic proportions.
This requires some linguistical parsing: it is an N_DOCS by N_TOPICS matrix, i.e. it has N_DOCS rows, one per document, and N_TOPICS columns, one per topic. The values are the topic proportions, i.e. if stm_model[1, ] == c(.3, .2, .5), that means Document 1 is 30% Topic 1, 20% Topic 2 and 50% Topic 3.
To find out what topic dominates a document, you have to find the (column!) index of the maximum value, which can be retrieved e.g. by calling apply with MARGIN=1, which basically says "do this row-wise"; which.max simply returns the index of the maximum value:
apply(stm_model$theta, MARGIN=1, FUN=which.max)

TensorFlow, Dataset API and flat_map operation

I'm having difficulties working with tf.contrib.data.Dataset API and wondered if some of you could help. I wanted to transform the entire skip-gram pre-processing of word2vec into this paradigm to play with the API a little bit, it involves the following operations:
Sequence of tokens are loaded dynamically (to avoid loading all dataset in memory at a time), say we then start with a Stream (to be understood as Scala's way, all data is not in memory but loaded when access is needed) of sequence of tokens: seq_tokens.
From any of these seq_tokens we extract skip-grams with a python function that returns a list of tuples (token, context).
Select for features the column of tokens and for label the column of contexts.
In pseudo-code to make it clearer it would look like above. We should be taking advantage of the framework parallelism system not to load by ourselves the data, so I would do something like first load in memory only the indices of sequences, then load sequences (inside a map, hence if not all lines are processed synchronously, data is loaded asynchronously and there's no OOM to fear), and apply a function on those sequences of tokens that would create a varying number of skip-grams that needs to be flattened. In this end, I would formally end up with data being of shape (#lines=number of skip-grams generated, #columns=2).
data = range(1:N)
.map(i => load(i): Seq[String]) // load: Int -> Seq[String] loads dynamically a sequence of tokens (sequences have varying length)
.flat_map(s => skip_gram(s)) // skip_gram: Seq[String] -> Seq[(String, String)] with output length
features = data[0] // features
lables = data[1] // labels
I've tried naively to do so with Dataset's API but I'm stuck, I can do something like:
iterator = (
tf.contrib.data.Dataset.range(N)
.map(lambda i: tf.py_func(load_data, [i], [tf.int32, tf.int32])) // (1)
.flat_map(?) // (2)
.make_one_shot_iterator()
)
(1) TensorFlow's not happy here because sequences loaded have differents lengths...
(2) Haven't managed yet to do the skip-gram part... I actually just want to call a python function that computes a sequence (of variable size) of skip-grams and flatten it so that if the return type is a matrix, then each line should be understood as a new line of the output Dataset.
Thanks a lot if anyone has any idea, and don't hesitate if I forgot to mention useful precisions...

I'm just implementing the same thing; here's how I solved it:
dataset = tf.data.TextLineDataset(filename)
if mode == ModeKeys.TRAIN:
dataset = dataset.shuffle(buffer_size=batch_size * 100)
dataset = dataset.flat_map(lambda line: string_to_skip_gram(line))
dataset = dataset.batch(batch_size)
In my dataset, I treat every line as standalone, so I'm not worrying about contexts that span multiple lines.
I therefore flat map each line through a function string_to_skip_gram that returns a Dataset of a length that depends on the number of tokens in the line.
string_to_skip_gram turns the line into a series of tokens, represented by IDs (using the method tokenize_str) using tf.py_func:
def string_to_skip_gram(line):
def handle_line(line):
token_ids = tokenize_str(line)
(features, labels) = skip_gram(token_ids)
return np.array([features, labels], dtype=np.int64)
res = tf.py_func(handle_line, [line], tf.int64)
features = res[0]
labels = res[1]
return tf.data.Dataset.from_tensor_slices((features, labels))
Finally, skip_gram returns a list of all possible context words and target words:
def skip_gram(token_ids):
skip_window = 1
features = []
labels = []
context_range = [i for i in range(-skip_window, skip_window + 1) if i != 0]
for word_index in range(skip_window, len(token_ids) - skip_window):
for context_word_offset in context_range:
features.append(token_ids[word_index])
labels.append(token_ids[word_index + context_word_offset])
return features, labels
Note that I'm not sampling the context words here; just using all of them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scikit-learn multi dimensional features - python

Related

"... has insufficient rank for batching." What is the problem with this 3 line code?

NLP for multi feature data set using TensorFlow

How to create my own dataset for keras model.fit() using Tensorflow(python)?

How to plot a document topic distribution in structural topic modeling R-package?

TensorFlow, Dataset API and flat_map operation

Categories

Resources