num_buckets as a parameter in a tensorflow feature column - python

Currently Tensorflow documentation define a categorical vocabulary column this way:
vocabulary_feature_column =
tf.feature_column.categorical_column_with_vocabulary_list(
key="feature_name_from_input_fn",
vocabulary_list=["kitchenware", "electronics", "sports"])
However this suppose that we input manually the vocabulary list.
In case of large dataset with many columns and many unique values I would like to automate the process this way:
for k in categorical_feature_names:
vocabulary_feature_column =
tf.feature_column.categorical_column_with_vocabulary_list(
key="feature_name_from_input_fn",
vocabulary_list=list_of_unique_values_in_the_column)
To do so I need to retrieve the parameter list_of_unique_values_in_the_column.
Is there anyway to do that with Tensorflow?
I know there is tf.unique that could return unique values in a tensor but I don't get how I could feed the column to it so it returns the right vocabulary list.

If list_of_unique_values_in_the_column is known, you can save them in one file and read by tf.feature_column.categorical_column_with_vocabulary_file. If unknown, you can use tf.feature_column.categorical_column_with_hash_bucket with a large enough size.

Related

TensorFlow batching: elements with same key in same batch of variable size

I have a pandas dataframe with id keys. For simplicity we can say the id keys are 0-99.
In a second column we have encodings of fixed length K. Each encoding is related to an id key and two or more encodings may be related to the same id key.
Example:
[0, encoding_1] [0, encoding_2] [1, encoding_3] [2, encoding_4] [2, encoding_5]
I'm able to get batches that contain the rows from each unique key and only those:
ds = ds.group_by_window(key_func=lambda elem: tf.cast(elem['id_col'], tf.int64), reduce_func=lambda _, window: window.batch(batch_size), window_size=batch_size )
But this situation is not ideal because I want the batches to contain multiple unique keys, and not just one (contrastive learning is the goal).
How would I get batches that follow this rule: they must be of some minimum size and if an encoding of id key X is in the batch, then so are all other encodings of id key X.
Any idea on how to approach this?
Thanks!
I think what you are searching for are generators. Keras model.fit() takes generators as input, so you can pass batches with different batch sizes.
What I would do:
Create a list of same id encodings from your dataframe (e.g. with a for-loop and pop). This should kind of look like this: [Array(encoding1, encoding2), Array(encoding3)]
Create a generator that yields the next batch from this list. This next batch contains as many entries from the list as you specify in its input
Optional: Create a dataset with Dataset.from_generators()
This is quite a lot of coding work, so unfortunately I don't have the time to do it, but let me know if you have any specific questions.

Import csv row as array in tensorflow

I have a csv file containing a large number N of columns: the first column contains the label, the other N-1 a numeric representation of my data (Chroma features from a music recording).
My idea is to represent the input data as an array. In practice, I want an equivalent of the standard representation of data in computer vision. Since my data is stored in a csv, inside the definition of the input train function, I need to a csv parser. I do it in this way
def parse_csv(line):
columns = tf.decode_csv(line, record_defaults=DEFAULTS) # take a line at a time
features = {'songID': columns[0], 'x': columns[1:]} # create a dictionary out of the features
labels = features.pop('songID') # define the label
return features, labels
def train_input_fn(data_file=fp, batch_size=128):
"""Generate an input function for the Estimator."""
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
dataset = dataset.map(parse_csv)
dataset = dataset.shuffle(1_000_000).repeat().batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
However, this returns an error that is not very significative: AttributeError: 'list' object has no attribute 'get_shape'. I know that the culprit is the definition of x inside the features dictionary, but I don't know how to correct it because, fundamentally, I don't really grok the data structures of tensorflow yet.
As it turns out, features need to be tensors. However, each column is a tensor in itself and taking columns[1:] results in a list of tensors. For creating a higher-dimensional tensor that stores the information from N-1 columns one should use tf.stack:
features = {'songID': columns[0], 'x': tf.stack(columns[1:])} # create a dictionary out of the features
tf.stack should solve.
There is a complete code example available in the following thread.
Tensorflow Python reading 2 files

(To prevent Memory Error)How to one hot encode word list to a matrix of INTEGER 8 in Keras using Tokenize class

AS FLOAT64 takes more memory,which is the default data type of the tokenized matrix,I want it to be in INT8 ,thus saving space.
link to documentation
This is the method I'm talking,
texts_to_matrix(texts):
Return: numpy array of shape (len(texts), num_words).
Arguments:
texts: list of texts to vectorize.
mode: one of "binary", "count", "tfidf", "freq" (default: "binary").
Taking a look at the source code, the result matrix is created here using np.zeros() with no dtype keyword argument which would result in dtype being set to default value set in function definition which is float. I think the choice of this data type is made to support all forms of transformation like tfidf which results in non-integer output.
So I think you have to options:
1. Change the source code
You can change add a keyword argument to definition of texts_to_matrix like dtype and change the line where matrix is created to
x = np.zeros((len(sequences), num_words), dtype=dtype)
2.Use another tool for preprocessing:
You can preprocess your text using another tool and then feed it to keras network. For example you can use scikit learn's CountVectorizer like:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(dtype=np.int8, ...)
matrix = cv.fit_transform(texts).toarray()

tensorflow-using dynamic shape when defining models

I have a batch of input:
input = tf.placeholder(tf.float32, [NUM_SAMPLE, None, 15])
For each one in the batch, I have a dictionary that describes the relationship of rows. It looks like:
dic = {i:{j:rij,k:rik,...},j:{i:rij,l:rjl,...},...}
Now I wanna do this for each sample and corresponding dic:
updated_sample = sample
for i in range(len(sample)):
for j in dic[i]:
tmp = concanate(sample[j],rij)
updated_sample[i] += matmul(tmp,W)
in which W is the same for all samples and rows.
However, I cannot use len(sample) in tensorflow. It seems tf.while_loop may be the answer, but I don't know how to use it in this problem. Any suggestions?
Besides, can I use dictionary in this way in tensorflow?
There are 2 analogs in tensorflow for len(sample):
tf.shape(sample)[0]
sample.get_shape().as_list()[0]
The first one, tf.shape(sample) returns a tensor of integers of length equal to the rank of the tensor, doing tf.shape(sample)[0] is a tensor with shape () and should be used within the tenosrflow workflow.
The second one, sample.get_shape() returns a Tensor.shape object, doing sample.get_shape().as_list() transforms this into a list of integers.
In your case, you should to use the second of these.
Consider also the option of doing this computations at the numpy level, and then input them into the graph through placeholders.

Removing fraction of a dataset

I am fairly new to python and numpy scipy packages in particular.
I am doing regression analysis for a class assignment which involves trying different regression techniques on a data set and see which one works. This involves deleting values from a dataset and see which algorithm performs well with reduced data set. Right now I am indexing upto a fraction of the length of dataset.
Something like.
data = np.loadtxt("filename")
to_be_used = data[0:int(0.6(len(data)))]
Is there any other way I can do this? Say, I want to randomly select 60% of the data instead of the first 60 elements.
You can grab a random set of data from your array using the numpy.random.choice function:
subset = np.random.choice(data, int(len(data)*0.6), replace=False)
However, if you want to create multiple non-overlapping random sets, you should instead shuffle your array, then use regular slices to get the amount you want in each chunk. For instance, to randomly split your data in half:
np.shuffle(data)
one_random_half = data[:len(data)//2]
other_random_half = data[len(data)//2:]

Categories

Resources