It looks like TensorForest, the Random Forest implementation inside TensorFlow, somehow supports categorical features as input (without one-hot encoding).
See
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/tensor_forest/python/ops/data_ops.py#L32
https://github.com/tensorflow/tensorflow/issues/4025#issuecomment-242814047
However it's not clear how to use them.
If you look at this example
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/random_forest_mnist.py
the 'x' parameter at line 65
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/random_forest_mnist.py#L65
must be a float array.
How could I pass categorical features (e.g. strings)?
When using the SKCompat wrapper to the estimator, the 'x' and 'y' parameters do need to be floats (because with that interface you can only pass in one object). However, using the estimator's input function interface (input_fn=...) that most examples use, the feature dictionary that input_fn returns can be a mix of floats, int, and string Tensors. Floats are treated as continuous, ints and strings are treated as categorical (creating x[i] == T decision nodes instead of x[i] <= T) and no one-hot encoding is needed. So you need to create an input function that returns batches of data (which is what the SKCompat interface does for you essentially).
Related
I need to convert my dataset (includes text format) to recordIO format. I have tried below code. However, I am unable to fix the below error. Do I need to make further changes in my data format?
ValueError: Unsupported dtype object on array
Code:
import io
import sagemaker.amazon.common as smac
X = df[['Subject','Body']].to_numpy()
y = df[['Label']].to_numpy()
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, X, y)
buf.seek(0)
Dataset example-
Label Subject Body
label a Test one Test Body
label b Test two Test second
According to documentation in "Common Data Formats for Training",
your content-type is associated with the algorithms in the following table:
ContentType
Algorithm
application/x-recordio
Object Detection Algorithm
application/x-recordio-protobuf
Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence
Looking at the guide in documentation (Data conversion), the data should be passed as arrays of numbers, not strings.
This means that an encoder of some kind is needed (e.g. LabelEncoder for labels precisely, but an encoding/embedding algorithm would be needed for the remaining data). Based on the result you want to achieve, you can decide what to use from a variety of methods such as One-hot-encoding, binary encoding, one-of-k-encoding or whatever or even complex word/sentence embedding algorithms.
For example, for a text classification task with RFC/SVM, it is first necessary to encode the text with more or less expressive embedding algorithms (e.g. fastText).
I haven't used neural networks for many years, so excuse my ignorance.
I was wondering what is the most appropriate way to train a LSTM model based on my dataset.
I have 3 attributes as follows:
Attribute 1: small int e.g., [123, 321, ...]
Attribute 2: text sequence ['cgtaatta', 'ggcctaaat', ... ]
Attribute 3: text sequence ['ttga', 'gattcgtt', ... ]
Class label: binary [0, 1, ...]
The length of each sample's attributes (2 or 3) is arbitrary; therefore I do not want to use them as words rather as sequences (that's why I want to use RNN/LSTM models).
Is it possible to have more than one (sequence) inputs to the LSTM model (are there examples)? Or should I concatenate them into one e.g., input 1: ["123 cgtaatta ttga", 0]
You don't need to concatonate the inputs into one, that part is done using the tf.keras.layers.Flatten() layer, which takes multiple inputs and and flattens them without affecting the batch size.
Read more here: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten
And here:
https://www.tensorflow.org/tutorials/structured_data/time_series#multi-step_dense
Not sure about most appropriate way since I wondered here looking for my own answers, but I do know you need to classify the data by providing some numerical identities to the text if applicable in your case.
Hope this helps
So the logistic regression from the sklearn library from Python has the .fit() function which takes x_train(features) and y_train(labels) as arguments to train the classifier.
It seems that x_train.shape = (number_of_samples, number_of_features)
For x_train I should use the extracted xvector.scp file, which I am reading like so:
b = kaldiio.load_scp('xvector.scp')
And I can print the content like so:
for file_id in b:
xvector = b[file_id]
print(xvector)
Right now the b variable is like a dictionary and you can get the x-vector value of the corresponding id. I want to use sklearn Logistic Regression to classify the x-vectors and in order to use the .fit() method I should pass an array as an argument.
My question is how can I make an array that contains only the xvector variables?
PS: the file_ids are like 1 million and each xvector has length of 512, which is too big for an array
It seems you are trying to store the dictionary into a numpy array. If the dictionary is small, you can directly store the values as:
import numpy as np
x = np.array(list(b.values()))
However, this will run into OOM issues if the dictionary is large. In this case, you would need to use np.memmap as explained here: https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/
Essentially, you have to add rows to the array one at a time, and flush it when you have run out of memory. The array is stored directly on the disk, so it avoids OOM issues.
in my Pandas Dataframe I have loads of boolean Features (True/False). Pandas correctly represents them as bool if I do df.dtypes. If I pass my data frame to h2o (h2o.H2OFrame(df)) the boolean features are represented as enum. So they are interpreted as categorical features with 2 categories.
Is there a way to change the type of the features from enum to bool? In Pandas I can use df.astype('bool'), is there an equivalent in H2o?
One idea was to encode True/False to their numeric representation (1/0) before converting df to a H2o-Frame. But H2o now recognises this as int64.
Thanks in Advance for help!
The enum type is used for categorical variables with two or more categories. So it includes boolean. I.e. there is no distinct bool category in H2O, and there is nothing you need to fix here.
By the way, if you have a lot of boolean features because you have manually done one-hot encoding, don't do that. Instead give H2O the original (multi-level categorical) data, and it will do one-hot encoding when needed, behind the scenes. This is better because for algorithms like decision trees) they can use multi-level categorical data directly, so it will be more efficient.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html for some alternatives you can try. The missing category is added for when that column is missing in production.
(But "What happens when you try to predict on a categorical level not seen during training?" at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/deep-learning.html#faq does not seem to describe the behaviour you see?)
Also see http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/use_all_factor_levels.html (I cannot work out from that description if you want it to be true or false, so try both ways!)
UPDATE: set use_all_factor_levels = F and it will only have one input neuron (plus the NA one) for each boolean input, instead of two. If your categorical inputs are almost all boolean types I'd recommend setting this. If your categorical inputs mostly have quite a lot levels I wouldn't (because, overall, it won't make much difference in the number of input neurons, but it might make the network easier to train).
WHY MISSING(NA)?
If I have a boolean input, e.g. "isBig", there will be 3 input neurons created for it. If you look at varimp() you can see there are named:
isBig.1
isBig.0
isBig.missing(NA)
Imagine you now put it into production, and the user does not give a value (or gives an NA, or gives an illegal value such as "2") for the isBig input. This is when the NA input neuron gets fired, to signify that we don't know if it is big or not.
To be honest, I think this cannot be any more useful than firing both the .0 and the .1 neurons, or firing neither of them. But if you are using use_all_factor_levels=F then it is useful. Otherwise all NA data gets treated as "not-big" rather than "could be big or not-big".
With CNTK I have created a network with 2 input neurons and 1 output neuron.
A line in the training file looks like
|features 1.567518 2.609619 |labels 1.000000
Then the network was trained with brain script. Now I want to use the network for predicting values. For example: Input data is [1.82, 3.57]. What ist the output from the net?
I have tried Python with the following code, but here I am new. Code does not work. So my question is: How to pass the input data [1.82, 3.57] to the eval function?
On stackoverflow there are some hints, here and here, but this is too abstract for me.
Thank you.
import cntk as ct
import numpy as np
z = ct.load_model("LR_reg.dnn", ct.device.cpu())
input_data= np.array([1.82, 3.57], dtype=np.float32)
pred = z.eval({ z.arguments[0] : input_data })
print(pred)
Here's the most defensive way of doing it. CNTK can be forgiving if you omit some of this when the network is specified with V2 constructs. Not sure about a network that was created with V1 code.
Basically you need a pair of braces for each axis. Which axes exist in Brainscript? There's a batch axis, a sequence axis and then the static axes of your network. You have one dimensional data so that means the following should work:
input_data= np.array([[[1.82, 3.57]]], dtype=np.float32)
This specifies a batch of one sequence, of length one, containing one 1d vector of two elements. You can also try omitting the outermost braces and see if you are getting the same result.
Update based on more information from the comment below, we should not forget that the V1 code also saved the part of the network that computes things like loss and accuracy. If we provide only the features, CNTK will complain that the labels have not been provided. There are two ways to deal with this issue. One possibility is to provide some fake labels, so that the network can evaluate these auxiliary operations. Another possibility is to identify the prediction and use that. If the prediction was called 'p' in V1, this python code
p = z.find_by_name('p')
should create a CNTK function that only needs the features in order to compute the prediction.