Randomly sample from multiple tf.data.Datasets in Tensorflow

Randomly sample from multiple tf.data.Datasets in Tensorflow - python

suppose I have N tf.data.Datasets and a list of N probabilities (summing to 1), now I would like to create dataset such that the examples are sampled from the N datasets with the given probabilities.
I would like this to work for arbitrary probabilities -> simple zip/concat/flatmap with fixed number of examples from each dataset is probably not what I am looking for.
Is it possible to do this in TF? Thanks!

As of 1.12, tf.data.experimental.sample_from_datasets provides this functionality:
https://www.tensorflow.org/api_docs/python/tf/data/experimental/sample_from_datasets
EDIT: Looks like in earlier versions this can be accessed by tf.contrib.data.sample_from_datasets

if p is a Tensor of probabilities (or unnormalized relative probabilities) where p[i] is the probability that dataset i is chosen, you can use tf.multinomial in conjunction with tf.contrib.data.choose_from_datasets:
# create some datasets and their unnormalized probability of being chosen
datasets = [
tf.data.Dataset.from_tensors(['a']).repeat(),
tf.data.Dataset.from_tensors(['b']).repeat(),
tf.data.Dataset.from_tensors(['c']).repeat(),
tf.data.Dataset.from_tensors(['d']).repeat()]
p = [1., 2., 3., 4.] # unnormalized
# random choice function
def get_random_choice(p):
choice = tf.multinomial(tf.log([p]), 1)
return tf.cast(tf.squeeze(choice), tf.int64)
# assemble the "choosing" dataset
choice_dataset = tf.data.Dataset.from_tensors([0]) # create a dummy dataset
choice_dataset = choice_dataset.map(lambda x: get_random_choice(p)) # populate it with random choices
choice_dataset = choice_dataset.repeat() # repeat
# obtain your combined dataset, assembled randomly from source datasets
# with the desired selection frequencies.
combined_dataset = tf.contrib.data.choose_from_datasets(datasets, choice_dataset)
Note that the dataset needs to be initialized (you can't use a simple make_one_shot_iterator):
choice_iterator = combined_dataset.make_initializable_iterator()
choice = choice_iterator.get_next()
with tf.Session() as sess:
sess.run(choice_iterator.initializer)
print ''.join([sess.run(choice)[0] for _ in range(20)])
>> ddbcccdcccbbddadcadb

I think you can use tf.contrib.data.rejection_resample to achieve target distribution.

Related

Tensorflow Datasets: Is there a way to only modify a certain percentage of labels?

I'm using the following example to analyse the performance of Computer Vision system depending on the data quality.
Keras Implementation Retinanet: https://keras.io/examples/vision/retinanet/
My goal is to corrupt(stretch, shift) certain percentages (10%,20%,30%) of the total bounding boxes across all images. This means that images should be randomly picked and them some of the bounding boxes corrupted so that in total the target percentage is affected.
I'm using the tensorflow datasets as my training data (e.g. https://www.tensorflow.org/datasets/catalog/kitti).
My basic idea was to generate an array in the size of the total amout of boxes and fill it with 1 (modify box) and 0 (ignore box) and then iterate through all boxes:
random_array = np.concatenate((np.ones(int(error_rate_size*TOTAL_NUMBER_OF_BOXES)+1,dtype=int),np.zeros(int((1-error_rate_size)*TOTAL_NUMBER_OF_BOXES)+1,dtype=int)))
The problem is that the implementation I'm using is heavily relying on graph implementation and specifially on the map function (https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map). I would like to follow this pattern in order to keep the implemented data pipeline.
What I am hopeing to do is to use map function in combination with a global counter so I can loop through the array and modify whenever a condition is given. It should roughly look something like this:
COUNT = 0
def damage_data(box):
scaling_range = 2.0
global COUNT
COUNT += 1
if random_array[COUNT]== 1:
new_box = tf.stack(
[
box[0]*scaling_range*tf.random.uniform(shape=(),minval=0.0,maxval=1.0,dtype=tf.float32,seed=1), # x center
box[1]*scaling_range*tf.random.uniform(shape=(),minval=0.0,maxval=1.0,dtype=tf.float32,seed=2), # y center
box[2]*scaling_range*tf.random.uniform(shape=(),minval=0.0,maxval=1.0,dtype=tf.float32,seed=3), # width,
box[3]*scaling_range*tf.random.uniform(shape=(),minval=0.0,maxval=1.0,dtype=tf.float32,seed=4), # height,
],
axis=-1,)
else:
tf.print("Not Changed")
new_box = tf.stack(
[
box[0],
box[1], # y center
box[2], # width,
box[3], # height,
],
axis=-1,)
return new_box
def damage_data_cross_sequential(image, bbox, class_id):
# bbox format [x_center, y_center, width, height]
bbox = tf.map_fn(damage_data,bbox)
return image, bbox, class_id
train_dataset = train_dataset.map(damage_data_cross_sequential,num_parallel_calls=1)
But using this code the variable COUNT is not incremented globally but rather every map() call starts from the initial value 0. I assume this somehow is caused through the graph implementation and the parallel processes in map().
The question is now if there is any way to globally increase a counter through the map function or if I could extend the given dataset with a unique identifier (e.g. add box[5] = id).
I hope the problem is clear and thanks already! :)
--------------UPDATE 1-------------------------------
The second approach as described by #Lescurel is what I'm trying to do.
Some clarifications about the dataset structure.
The number of boxes per image is not identical.It changes from image to image.
e.g. sample 1: ((x_dim, y_dim, 3), (4,4)), sample 2: ((x_dim, y_dim, 3), (2,4))
For a better understanding the structure can be reproduced with the following:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
valid_ds = tfds.load('kitti', split='validation') # validation is a smaller set
def select_relevant_info(sample):
image = sample["image"]
bbox = sample["objects"]["bbox"]
class_id = tf.cast(sample["objects"]["type"], dtype=tf.int32)
return image, bbox, class_id
valid_ds = valid_ds.map(select_relevant_info)
for sample in valid_ds.take(1):
print(sample)

For plenty of reasons, using a global state is not a terribly good idea, but it's probably even worse in a concurrent context like this one.
There is at least two other ways of implementing what you want:
using a random sample with a threshold as condition to modify the label
put your random array in the dataset as the condition to modify the label.
I personally prefer the first option, which is simpler.
An example.
Lets generate some random data, and create a tf.Dataset. In that example, the total number of sample is 1000:
imgs = tf.random.uniform((1000, 4, 4))
boxes = tf.ones((1000, 4))
ds = tf.data.Dataset.from_tensor_slices((imgs, boxes))
First option: Random Sample
This function will draw a number uniformly between 0 and 1. If this number is higher than the threshold prob, then nothing happens. Otherwise, we modify the label. In that example, it gives a 0.05% chance of modifying the label.
def change_label_with_prob(label, prob=0.05, scaling_range=2.):
return tf.cond(
tf.random.uniform(()) > prob,
lambda: label,
lambda: label*scaling_range*tf.random.uniform((4,), 0., 1., dtype=tf.float32),
)
You can simply call it with Dataset.map:
new_ds = ds.map(lambda img, box: (img, change_label_with_prob(box)))
Second Option : Pass the condition array around
First, we generate an array filled with our conditions: 1 if we want to modify the array, 0 if not.
# lets set the number to change to 200
N_TO_CHANGE = 200
# randomly generated array with 200 "1" and "800" 0.
cond_array = tf.random.shuffle(
tf.concat([tf.ones((N_TO_CHANGE,),dtype=tf.bool), tf.zeros((1000 - N_TO_CHANGE,),dtype=tf.bool)], axis=0)
)
Then we can create a dataset from that array of conditions, and zip it with our previous dataset:
# creating a dataset from the conditional array
ds_cond = tf.data.Dataset.from_tensor_slices(cond_array)
# zipping the two datasets together
ds_data_and_cond = tf.data.Dataset.zip((ds, ds_cond))
# each element of that dataset is ((img, box), cond)
We can write our function, roughly the same as before:
def change_label_with_cond(label, cond, scaling_range=2.0):
# if true, modifies, do nothing otherwise
return tf.cond(
cond,
lambda: label
* scaling_range
* tf.random.uniform((4,), 0.0, 1.0, dtype=tf.float32),
lambda: label,
)
And then map the function on our new dataset, paying attention to the nested shape of each element of the dataset:
ds_changed_label = ds_data_and_cond.map(
lambda img_and_box, z: (img_and_box[0], change_label_with_cond(img_and_box[1], z))
)
# New dataset has a shape (img, box), same as before the zipping

Exponential Moving Average Discrepancy?

I have the following pandas object. It is an OHLC time-series data-frame.
I would like to calculate the EMA30 of the close column. For that, I used 2 different approaches, just as a test.
# Approach A, as explained by sentdex in this video:
# https://youtu.be/t_JXXT7VgeQ?list=PLbLcS9xv6IuGi8uyxMP3-BN-lTRQNpqEG&t=245
def ExpMovingAverage(values, window):
weights = np.exp(np.linspace(-1., 0., window))
weights /= weights.sum()
a = np.convolve(values, weights, mode='full')[:len(values)]
a[:window] = a[window]
return a
# Approach B
pd.Series.ewm(local_df['close'].copy(), span=30).mean()
Once, calculated I add them into their respective new columns.
# EMA30 (Using approach A)
local_df['ema30_a'] = pd.Series.ewm(local_df['close'].copy(), span=30).mean()
# EMA30 (Using approach B)
x = local_df['close'].values
calculate_ema30_b = ExpMovingAverage(x, 30)
local_df['ema30_b'] = calculate_ema30_b
The resulting data frame is below:
However, once plotted, it seems like the pandas (blue) deviates from the other numpy based approaches (red). In that case, which of the calculation methods is the one that is correct?

Is there a way to call a Numpy function inside a TensorFlow session?

I am trying to implement a Expectation Maximization algorithm using TensorFlow and TensorFlow Probability. It worked very well until I tried to implement Missing Data (data can contain NaN values in some random dimensions).
The problem is that with Missing Data I can no longer do all the operations as vector operations, I have to work with indexing and for-loops, like this:
# Here we iterate through all the data samples
for i in range(n):
# x_i is the sample i
x_i = tf.expand_dims(x[:, i], 1)
gamma.append(estimate_gamma(x_i, pi, norm, ber))
est_x_n_i = []
est_xx_n_i = []
est_x_b_i = []
for j in range(k):
mu_k = norm.mean()[j, :]
sigma_k = norm.covariance()[j, :, :]
rho_k = ber.mean()[j, :]
est_x_n_i.append(estimate_x_norm(x_i[:d, :], mu_k, sigma_k))
est_xx_n_i.append(estimate_xx_norm(x_i[:d, :], mu_k, sigma_k))
est_x_b_i.append(estimate_x_ber(x_i[d:, :], rho_k))
est_x_n.append(tf.convert_to_tensor(est_x_n_i))
est_xx_n.append(tf.convert_to_tensor(est_xx_n_i))
est_x_b.append(tf.convert_to_tensor(est_x_b_i))
What I found out was that these operations are not very efficient. While the first samples took about less than 1 second per sample, after 50 samples it took about 3 seconds per sample. I guess that this was happening because I was creating different tensors inside the session and that was messing up the memory or something.
I am quite new using TensorFlow and a lot of people only use TensorFlow for Deep Learning and Neural Networks so I couldn't find a solution for this.
Then I tried to implement the previous for-loop and the functions called inside that loop using only numpy arrays and numpy operations. But this returned the following error:
You must feed a value for placeholder tensor 'Placeholder_4' with
dtype double and shape [8,18]
This error happens because when it tries to execute the numpy functions inside the loop, the placeholder has not been fed yet.
pi_k, mu_k, sigma_k, rho_k, gamma_ik, exp_loglik = exp_max_iter(x, pi, dist_norm, dist_ber)
pi, mu, sigma, rho, responsability, NLL[i + 1] = sess.run([pi_k, mu_k, sigma_k, rho_k, gamma_ik, exp_loglik],{x: samples})
Is there any way to solve this? Thanks.

To answer your title question "Is there a way to call a Numpy function inside a TensorFlow session?", I've put in place below some sample code to execute a "numpy function" (sklearn.mixture.GaussianMixture) given missing data by directly calling the function or via Tensorflow's py_function. I am sensing this may not 100% be what you are looking for... in the case that you are just trying to implement EM..? the existing implementation of Gaussian Mixture Model in Tensorflow may be of some help:
documentation on tf.contrib.factorization.gmm:
https://www.tensorflow.org/api_docs/python/tf/contrib/factorization/gmm
implementation:
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/contrib/factorization/python/ops/gmm_ops.py#L462-L506
Sample code to call a 'numpy function' directly and within Tensorflow graph:
import numpy as np
np.set_printoptions(2)
import tensorflow as tf
from sklearn.mixture import GaussianMixture as GMM
def myfunc(x,istf=True):
#strip nans
if istf:
mask = ~tf.is_nan(x)
x = tf.boolean_mask(x,mask)
else:
ind=np.where(~np.isnan(x))
x = x[ind]
x = np.expand_dims(x,axis=-1)
gmm = GMM(n_components=2)
gmm.fit(x)
m0,m1 = gmm.means_[:,0]
return np.array([m0,m1])
# create data with nans
np.random.seed(42)
x = np.random.rand(5,28,1)
c = 5
x.ravel()[np.random.choice(x.size, c, replace=False)] = np.nan
# directly call "numpy function"
for ind in range(x.shape[0]):
val = myfunc(x[ind,:],istf=False)
print(val)
[0.7 0.26]
[0.15 0.72]
[0.77 0.2 ]
[0.65 0.23]
[0.35 0.87]
# initialization
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# create graph
X = tf.placeholder(tf.float32, [28,1])
Y = tf.py_function(myfunc,[X],[tf.float32],name='myfunc')
# call "numpy function" in tensorflow graph
for ind in range(x.shape[0]):
val = sess.run(Y, feed_dict={X: x[ind,:],})
print(val)
[array([0.29, 0.76], dtype=float32)]
[array([0.72, 0.15], dtype=float32)]
[array([0.77, 0.2 ], dtype=float32)]
[array([0.23, 0.65], dtype=float32)]
[array([0.35, 0.87], dtype=float32)]

You can convert your numpy function into tensorflow function then it might not create problem when calling inside a session a simple function is following. Make an IOU function in numpy and then call it via tf.numpy_functionhere
def IOU(Pred, GT, NumClasses, ClassNames):
ClassIOU=np.zeros(NumClasses)#Vector that Contain IOU per class
ClassWeight=np.zeros(NumClasses)#Vector that Contain Number of pixel per class Predicted U Ground true (Union for this class)
for i in range(NumClasses): # Go over all classes
Intersection=np.float32(np.sum((Pred==GT)*(GT==i)))# Calculate class intersection
Union=np.sum(GT==i)+np.sum(Pred==i)-Intersection # Calculate class Union
if Union>0:
ClassIOU[i]=Intersection/Union# Calculate intesection over union
ClassWeight[i]=Union
# b/c we will only take the mean over classes that are actually present in the GT
present_classes = np.unique(GT)
mean_IOU = np.mean(ClassIOU[present_classes])
# append it in final results
ClassNames = np.append(ClassNames, 'Mean')
ClassIOU = np.append(ClassIOU, mean_IOU)
ClassWeight = np.append(ClassWeight, np.sum(ClassWeight))
return mean_IOU
# an now call as
NumClasses=6
ClassNames=['Background', 'Class_1', 'Class_1',
'Class_1 ', 'Class_1', 'Class_1 ']
x = tf.numpy_function(IOU, [y_pred, y_true, NumClasses, ClassNames],
tf.float64, name=None)

PCA analysis considering N-less relevant components

I am trying to learn the basics of PCA analysis in Python using scikit libraries (in particular sklearn.decomposition and sklearn.preprocessing). The goal is to import data from images into a matrix X (each row is a sample, each column is a feature), then standardize X, use PCA to extract principal components (2 most important, 6 most important....6 less important), project X on these principal components, reverse the previous transformation and plot the result in order to see the difference with respect to the original image/images.
Now let's say that I do not want to consider the 2,3,4... most important principal components but I want to consider the N less relevant components, let's say N=6.
How should the analysis be done?
I mean I can't simply standardize then call PCA().fit_transform and then revert back with inverse_transform() to plot the results.
At the moment I am doing something like this:
X_std = StandardScaler().fit_transform(X) # standardize original data
pca = PCA()
model = pca.fit(X_std) # create model with all components
Xprime = model.components_[range(dim-6, dim, 1),:] # get last 6 PC
And then I stop because I know I should call transform() but I do not understand how to do it...I tried several times withouth being successfull.
Is there someone that can tell me if previous steps are correct and point out the direction to follow?
Thank you very much
EDIT: currently I have adapted this solution as suggested by the first answer to my question:
model = PCA().fit(X_std)
model2pc = model
model2pc.components_[range(2, img_count, 1), :] = 0
Xp_2pc = model2pc.transform(X_std)
Xr_2pc = model2pc.inverse_transform(Xp_2pc)
And then I do the same for 6pc, 60pc, last 6 pc. What I have noticed is that this is very time consuming. I would like to get a model directly extracting the principal components I need (without zeroing out the others) and then perform transform() and inverse_transform() on that with that model.

If you want to ignore all but the last 6 principal components, you can just zero out the ones you don't want to keep.
N = 6
X_std = StandardScaler().fit_transform(X)
pca = PCA()
model = pca.fit(X_std) # create model with all components
model.components_[:-N] = 0
Then, to remove all but the last N components from the data, just do a forward and inverse transform of the data:
Xprime = model.inverse_transform(model.transform(X_std))
Here is an example:
>>> X = np.random.rand(18).reshape(6, 3)
>>> model = PCA().fit(X)
A round-trip transform should give back the original data:
>>> X
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
>>> model.inverse_transform(model.transform(X))
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
Now zero out the first principal component:
>>> model.components_
array([[ 0.22969899, 0.21209762, 0.94986998],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
>>> model.components_[:-2] = 0
>>> model.components_
array([[ 0. , 0. , 0. ],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
The round-trip transform now gives a different result since we've removed the first principal component (which contains the greatest amount of variance):
>>> model.inverse_transform(model.transform(X))
array([[ 0.12742811, -0.01189858, 0.68108405],
[ 0.36513945, 0.33308073, 0.54656949],
[ 0.58029482, 0.33392119, 0.49435263],
[ 0.39987803, 0.35478779, 0.53332196],
[ 0.71114004, 0.56787176, 0.41047233],
[ 0.44000711, 0.16692583, 0.56556581]])

Select 5 data points closest to SVM hyperlane

I have written Python code using Sklearn to cluster my dataset:
af = AffinityPropagation().fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_= len(cluster_centers_indices)
I am exploring the use of query-by-clustering and so form an inital training dataset by:
td_title =[]
td_abstract = []
td_y= []
for each in centers:
td_title.append(title[each])
td_abstract.append(abstract[each])
td_y.append(y[each])
I then train my model (an SVM) on it by:
clf = svm.SVC()
clf.fit(X, data_y)
I wish to write a function that given the centres, the model, the X values and the Y values will append the 5 data points which the model is most unsure about, ie. the data points closest to the hyperplane. How can I do this?

The first steps of your process aren't entirely clear to me, but here's a suggestion for "Select(ing) 5 data points closest to SVM hyperplane". The scikit documentation defines decision_function as the distance of the samples to the separating hyperplane. The method returns an array which can be sorted with argsort to find the "top/bottom N samples".
Following this basic scikit example, define a function closestN to return the samples closest to the hyperplane.
import numpy as np
def closestN(X_array, n):
# array of sample distances to the hyperplane
dists = clf.decision_function(X_array)
# absolute distance to hyperplane
absdists = np.abs(dists)
return absdists.argsort()[:n]
Add these two lines to the scikit example to see the function implemented:
closest_samples = closestN(X, 5)
plt.scatter(X[closest_samples][:, 0], X[closest_samples][:, 1], color='yellow')
Original
Closest Samples Highlighted
If you need to append the samples to some list, you could somelist.append(closestN(X, 5)). If you needed the sample values you could do something like somelist.append(X[closestN(X, 5)]).
closestN(X, 5)
array([ 1, 20, 14, 31, 24])
X[closestN(X, 5)]
array([[-1.02126202, 0.2408932 ],
[ 0.95144703, 0.57998206],
[-0.46722079, -0.53064123],
[ 1.18685372, 0.2737174 ],
[ 0.38610215, 1.78725972]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Randomly sample from multiple tf.data.Datasets in Tensorflow - python

As of 1.12, tf.data.experimental.sample_from_datasets provides this functionality: https://www.tensorflow.org/api_docs/python/tf/data/experimental/sample_from_datasets EDIT: Looks like in earlier versions this can be accessed by tf.contrib.data.sample_from_datasets

I think you can use tf.contrib.data.rejection_resample to achieve target distribution.

Related

Tensorflow Datasets: Is there a way to only modify a certain percentage of labels?

Exponential Moving Average Discrepancy?

Is there a way to call a Numpy function inside a TensorFlow session?

PCA analysis considering N-less relevant components

Select 5 data points closest to SVM hyperlane

Categories

Resources