I have a csv file containing a large number N of columns: the first column contains the label, the other N-1 a numeric representation of my data (Chroma features from a music recording).
My idea is to represent the input data as an array. In practice, I want an equivalent of the standard representation of data in computer vision. Since my data is stored in a csv, inside the definition of the input train function, I need to a csv parser. I do it in this way
def parse_csv(line):
columns = tf.decode_csv(line, record_defaults=DEFAULTS) # take a line at a time
features = {'songID': columns[0], 'x': columns[1:]} # create a dictionary out of the features
labels = features.pop('songID') # define the label
return features, labels
def train_input_fn(data_file=fp, batch_size=128):
"""Generate an input function for the Estimator."""
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
dataset = dataset.map(parse_csv)
dataset = dataset.shuffle(1_000_000).repeat().batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
However, this returns an error that is not very significative: AttributeError: 'list' object has no attribute 'get_shape'. I know that the culprit is the definition of x inside the features dictionary, but I don't know how to correct it because, fundamentally, I don't really grok the data structures of tensorflow yet.
As it turns out, features need to be tensors. However, each column is a tensor in itself and taking columns[1:] results in a list of tensors. For creating a higher-dimensional tensor that stores the information from N-1 columns one should use tf.stack:
features = {'songID': columns[0], 'x': tf.stack(columns[1:])} # create a dictionary out of the features
tf.stack should solve.
There is a complete code example available in the following thread.
Tensorflow Python reading 2 files
Related
I have multiple h5py files(pixel-level annotations) for one image. Image Masks are stored in hdf5 files as key-value pairs with the key being the id of some class. The masks (hdf5 files) all match the dimension of their corresponding image and represent labels for pixels in the image. I need to compare all the h5 files with one another and find out the final mask that represents the majority.
But I don't know how to compare multiple h5 files in python. Can someone kindly help?
What do you mean by "compare"?
If you just want to compare the files to see if they are the same, you can use the h5diff utility from The HDF5 Group. It comes with the HDF5 installer. You can get more info about h5diff here: h5diff utility. Links to all HDF5 utilities are at the top of the page:HDF5 Tools
It sounds like you need to do more that that. Please clarify what you mean by "find out the final mask that represents the majority". Do you want to find the average image size (either mean, median, or mode)? If so, it is "relatively straight-forward" (if you know Python) to open each file and get the dimension of the image data (the shape of each dataset -- what you call the values). For reference, the key, value terminology is how h5py refers to HDF5 dataset names and datasets.
Here is a basic outline of the process to open 1 HDF5 file and loop thru the datasets (by key name) to get the dataset shape (image size). For multiple files, you can add a for loop using the iglob iterator to get the HDF5 file names. For simplicity, I saved the shape values to 3 lists and manually calculated the mean (sum()/len()). If you want to calculate the mask differently, I suggest using NumPy arrays. It has mean and median functions built-in. For mode, you need scipy.stats module (it works on NumPy arrays).
Method 1: iterates on .keys()
s0_list = []
s1_list = []
s2_list = []
with h5py.File(filename,'r')as h5f:
for name in h5f.keys() :
shape = h5f[name].shape
s0_list.append(shape[0])
s1_list.append(shape[1])
s2_list.append(shape[2])
print ('Ave len axis=0:',sum(s0_list)/len(s0_list))
print ('Ave len axis=1:',sum(s1_list)/len(s1_list))
print ('Ave len axis=2:',sum(s2_list)/len(s2_list))
Method 2: iterates on .items()
s0_list = []
s1_list = []
s2_list = []
with h5py.File(filename,'r')as h5f:
for name, ds in h5f.items() :
shape = ds.shape
s0_list.append(shape[0])
s1_list.append(shape[1])
s2_list.append(shape[2])
print ('Ave len axis=0:',sum(s0_list)/len(s0_list))
print ('Ave len axis=1:',sum(s1_list)/len(s1_list))
print ('Ave len axis=2:',sum(s2_list)/len(s2_list))
AS FLOAT64 takes more memory,which is the default data type of the tokenized matrix,I want it to be in INT8 ,thus saving space.
link to documentation
This is the method I'm talking,
texts_to_matrix(texts):
Return: numpy array of shape (len(texts), num_words).
Arguments:
texts: list of texts to vectorize.
mode: one of "binary", "count", "tfidf", "freq" (default: "binary").
Taking a look at the source code, the result matrix is created here using np.zeros() with no dtype keyword argument which would result in dtype being set to default value set in function definition which is float. I think the choice of this data type is made to support all forms of transformation like tfidf which results in non-integer output.
So I think you have to options:
1. Change the source code
You can change add a keyword argument to definition of texts_to_matrix like dtype and change the line where matrix is created to
x = np.zeros((len(sequences), num_words), dtype=dtype)
2.Use another tool for preprocessing:
You can preprocess your text using another tool and then feed it to keras network. For example you can use scikit learn's CountVectorizer like:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(dtype=np.int8, ...)
matrix = cv.fit_transform(texts).toarray()
Currently Tensorflow documentation define a categorical vocabulary column this way:
vocabulary_feature_column =
tf.feature_column.categorical_column_with_vocabulary_list(
key="feature_name_from_input_fn",
vocabulary_list=["kitchenware", "electronics", "sports"])
However this suppose that we input manually the vocabulary list.
In case of large dataset with many columns and many unique values I would like to automate the process this way:
for k in categorical_feature_names:
vocabulary_feature_column =
tf.feature_column.categorical_column_with_vocabulary_list(
key="feature_name_from_input_fn",
vocabulary_list=list_of_unique_values_in_the_column)
To do so I need to retrieve the parameter list_of_unique_values_in_the_column.
Is there anyway to do that with Tensorflow?
I know there is tf.unique that could return unique values in a tensor but I don't get how I could feed the column to it so it returns the right vocabulary list.
If list_of_unique_values_in_the_column is known, you can save them in one file and read by tf.feature_column.categorical_column_with_vocabulary_file. If unknown, you can use tf.feature_column.categorical_column_with_hash_bucket with a large enough size.
I have a batch of input:
input = tf.placeholder(tf.float32, [NUM_SAMPLE, None, 15])
For each one in the batch, I have a dictionary that describes the relationship of rows. It looks like:
dic = {i:{j:rij,k:rik,...},j:{i:rij,l:rjl,...},...}
Now I wanna do this for each sample and corresponding dic:
updated_sample = sample
for i in range(len(sample)):
for j in dic[i]:
tmp = concanate(sample[j],rij)
updated_sample[i] += matmul(tmp,W)
in which W is the same for all samples and rows.
However, I cannot use len(sample) in tensorflow. It seems tf.while_loop may be the answer, but I don't know how to use it in this problem. Any suggestions?
Besides, can I use dictionary in this way in tensorflow?
There are 2 analogs in tensorflow for len(sample):
tf.shape(sample)[0]
sample.get_shape().as_list()[0]
The first one, tf.shape(sample) returns a tensor of integers of length equal to the rank of the tensor, doing tf.shape(sample)[0] is a tensor with shape () and should be used within the tenosrflow workflow.
The second one, sample.get_shape() returns a Tensor.shape object, doing sample.get_shape().as_list() transforms this into a list of integers.
In your case, you should to use the second of these.
Consider also the option of doing this computations at the numpy level, and then input them into the graph through placeholders.
I am fairly new to python and numpy scipy packages in particular.
I am doing regression analysis for a class assignment which involves trying different regression techniques on a data set and see which one works. This involves deleting values from a dataset and see which algorithm performs well with reduced data set. Right now I am indexing upto a fraction of the length of dataset.
Something like.
data = np.loadtxt("filename")
to_be_used = data[0:int(0.6(len(data)))]
Is there any other way I can do this? Say, I want to randomly select 60% of the data instead of the first 60 elements.
You can grab a random set of data from your array using the numpy.random.choice function:
subset = np.random.choice(data, int(len(data)*0.6), replace=False)
However, if you want to create multiple non-overlapping random sets, you should instead shuffle your array, then use regular slices to get the amount you want in each chunk. For instance, to randomly split your data in half:
np.shuffle(data)
one_random_half = data[:len(data)//2]
other_random_half = data[len(data)//2:]