tf.random.categorical giving strange results - python

I am trying to implement np.random.choice in tensorflow. Here is my implementation
import numpy as np
import tensorflow as tf
p=tf.Variable(0,tf.int32)
selection_sample=[i for i in range(10)]#sample to select from
k=tf.convert_to_tensor(selection_sample)
samples = tf.random.categorical(tf.math.log([[1, 0.5, 0.3, 0.6]]),1)
sample_selected=tf.cast(samples[0][0],tf.int64)
op=tf.assign(p,k[sample_selected])
#selection_sample[samples]
init=tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sample_selected.eval())
print(k.eval())
print((sess.run(op)))
print(p.eval())
However when sample_selected is for example 1 i expect p.eval to be 1 i.e k[1] but this is not the case. For example running this code a sample output is
3
[0 1 2 3 4 5 6 7 8 9]
1
1
yet p.eval should be k[3] and sess.run(op) should also be k[3]
what I am doing wrong. Thanks

When you do:
print(sample_selected.eval())
You get a random value derived from tf.random.categorical. That random value is returned by the session and not saved anywhere else.
Then, when you do:
print((sess.run(op)))
You are assigning the variable p a new random value produced in this call to run. That is the value printed, which is now saved in the variable.
Finally, when you do:
print(p.eval())
You see the value currently stored in p, which is the random value generated in the previous call to run.

Related

Unexpected behavior when using R sample function with rpy2?

I need to cross-validate an R code in python. My code contains lots of pseudo-random number generations, so, for an easier comparison, I decided to use rpy2 to generate those values in my python code "from R".
As an example, in R, I have:
set.seed(1234)
runif(4)
[1] 0.1137034 0.6222994 0.6092747 0.6233794
In python, using rpy2, I have:
import rpy2.robjects as robjects
set_seed = robjects.r("set.seed")
runif = robjects.r("runif")
set_seed(1234)
print(runif(4))
[1] 0.1137034 0.6222994 0.6092747 0.6233794
as expected (values are similar). However, I face a strange behavior with the R sample function (equivalent to the numpy.random.choice function).
As the simplest reproducible example, I have in R:
set.seed(1234)
sample(5)
[1] 1 3 2 4 5
while in python I have:
sample = robjects.r("sample")
set_seed(1234)
print(sample(5))
[1] 4 5 2 3 1
The results are different. Could anyone explain why this happens and/or provide a way to get similar values in R and python using the R sample function?
If you print the value of the R function RNGkind() in both situations, I suspect you won't get the same answer. The Python result looks like the default output, while your R result looks like the old buggy output.
For example, in R:
set.seed(1234, sample.kind = "Rejection")
sample(5)
#> [1] 4 5 2 3 1
set.seed(1234, sample.kind = "Rounding")
#> Warning in set.seed(1234, sample.kind = "Rounding"): non-uniform 'Rounding'
#> sampler used
sample(5)
#> [1] 1 3 2 4 5
set.seed(1234, sample.kind = "default")
sample(5)
#> [1] 4 5 2 3 1
Created on 2021-01-15 by the reprex package (v0.3.0)
So it looks to me as though you are still using the old "Rounding" method in your R session. You probably saved a workspace a long time ago, and have reloaded it since. Don't do that, start with a clean workspace each session.
Maybe give this a shot (stackoverflow answer from here). Quoting the answer : "The p argument corresponds to the prob argument in the sample()function"
import numpy as np
np.random.choice(a, size=None, replace=True, p=None)

tf.print() vs Python print vs tensor.eval()

It seems that in Tensorflow, there are at least three methods to print out the value of a tensor. I've been reading here and there, yet still very confused.
These authors seem to summarize the different usage as:
Python print: can only print out certain attributes of a tensor, e.g. its shape, because outside the computation graph it's merely an operation.
tensor.eval() : Not sure its difference with tf.print()
tf.print(): can output the actual value of a tensor, but must be inserted somewhere in the graph as well as being used by some other operation otherwise it will be dangling and still not printed.
My confusion probably also lies in I'm not sure how much Python functionalities we can access in a tensorflow computation graph, or where the computation graph "ends" and Python code "begins". e.g.
If I insert a Python print between two lines where I construct a computation graph, when I call sess.run() later, will this line be called?
If I want to print out multiple tensor values in a computation graph, where should I put these statements?
What's the difference between tensor.eval() and tf.print()? How should I use them differently?
The native Python print() statement will be called when the graph is build the first time. Check this out:
a = tf.placeholder(shape=None, dtype=tf.int32)
b = tf.placeholder(shape=None, dtype=tf.int32)
print("a is ",a," while b is ",b)
c = tf.add(a, b)
with tf.Session() as sess:
print(sess.run(c, feed_dict={a: 1, b: 2}))
print(sess.run(c, feed_dict={a: 3, b: 1}))
By exectuing this code block, the output is:
# a is Tensor("Placeholder:0", dtype=int32) while b is Tensor("Placeholder_1:0", dtype=int32)
# 3
# 4
On the other hand, let us see tf.print():
a = tf.placeholder(shape=None, dtype=tf.int32)
b = tf.placeholder(shape=None, dtype=tf.int32)
print_op = tf.print("a is ",a," while b is ",b)
with tf.control_dependencies([print_op]):
c = tf.add(a, b)
with tf.Session() as sess:
print(sess.run(c, feed_dict={a: 1, b: 2}))
print(sess.run(c, feed_dict={a: 3, b: 1}))
So, according to the output below, we can see that if we add the dependency that the tf.print op must be run whenever c is run, we get to see the output we want to:
# a is 1 while b is 2
# 3
# a is 3 while b is 1
# 4
Finally, tensor.eval() is identical to sess.run(tensor). However, the limitation of tensor.eval() is that you can run it to evaluate a single tensor, while the tf.Session can be used to evaluate multiple tensors sess.run([tensor1, tensor2]). If you ask me, I would always use sess.run(list_of_tensors), to evaluate as many tensors as I want to, and print out their values.
No. The Python print is not called when you call sess.run() later.
If you want to print when you call sess.run() then you can use tf.print.
To print out multiple tensor values in a graph, you should use sess.run() after opening tf.Session(). Sample code is below.
t = tf.constant(42.0)
u = tf.constant(37.0)
pt = tf.print(t)
pu = tf.print(u)
with sess.as_default():
sess.run([pt, pu])
42
37
This answer and this in another question will be helpful.
tensor.eval() evaluates tensor operation and is not an operator.
tf.print() is just an operator that prints out given tensor. So after invoking tf.Session(), tf.print() is to be one of the graph nodes.

How to access outer scope within Tensorflow Datset's map transformation?

I am interested to know how we may access state in "outer" scopes within the method passed to the Tensorflow Dataset's map transformation.
For example, if we have the following Python code (not using Tensorflow):
def entry():
count = 0
ints = [1,2,3,4,5]
def work_with_int(i):
nonlocal count
count += 1
return i + 1
ds = map(work_with_int, ints)
for i in ds:
print(i)
# count should be 5 now
print("count is {}".format(count))
entry()
We would expect to see that the value of count is 5 after we iterate through ds. This is because, in each call to work_with_int, we assign to the count variable defined in the scope of entry.
Let's suppose we wish to achieve similar behaviour using Tensorflow Datasets:
import tensorflow as tf
tf.enable_eager_execution()
def entry():
count = 0
# 'ints' represents the items in the Tensorflow Dataset
ints = [1,2,3,4,5]
# define a method that operates on each item in the Dataset
def work_with_int(i):
nonlocal count
count += 1
return i + 1
# create Dataset and map
ds = tf.data.Dataset.from_tensor_slices(ints).map(work_with_int)
# as we iterate through the dataset, the 'work_with_int' method will be called
for i in ds:
print(i)
# count should be 5 now
print("count is {}".format(count))
entry()
In this case, however, count remains at 1 after iterating through ds.
Is there any reason for this behaviour?
Is there a way to access "outer" scope from within work_with_int in the second example?
In my view,tensorflow build a gragh in start,and run gragh when execute sess.run().you not run gragh,so your code about gragh not execute.
In addition,a part of code will be discarded when build gragh.SO,even if,we run gragh,not
accumulation

Trying to produce different random numbers every time a loop is executed

SRate = [5,5,5]
Bots = 120
NQueue = 3
TSim = 100
Exp = 2
DDistance = 1
Lambda = 40 # 120/3 = 40
import random
Elist=[]
AvgSRate = 5
def Initilization(AvgSRate,Lambda,Exp):
return float((Lambda/(AvgSRate**Exp))*(1+(1/10)*(2*(np.random.seed(-28)) - 1)))
for a in range(1,361):
Elist.append(Initilization)
I am trying to produce a set of randomly generated numbers between 0 and 1, so decimals for the initialization of a simulation, however It spits out the same values every time the loop is executed, so when I print Elist I get a list of all the same values. Also the list has <Function Initialization at "the number">, could someone help me eliminate that portion from the printed list to only have the numbers.
The issue is np.random.seed(-28) is just seeding the random generator [Documentation] , it does not return any random numbers , for getting a random number use - np.random.rand() .
Example -
return float((Lambda/(AvgSRate**Exp))*(1+(1/10)*(2*(np.random.rand()) - 1)))
If you want to seed the random generator, you can call it before calling the for loop that calls the Initilization function.
The function after this should look like -
def Initilization(AvgSRate,Lambda,Exp):
return float((Lambda/(AvgSRate**Exp))*(1+(1/10)*(2*(np.random.rand()) - 1)))
Also, one more issue, is that you are not adding the value returned from the function call to the list - Elist.append(Initilization) , you are just adding the function reference.
you need to change that to Elist.append(Initilization(<values for the parameter>)) to call the function and append the returned value into EList .

scikit's GridSearch and Python in general are not freeing memory

I made some weird observations that my GridSearches keep failing after a couple of hours and I initially couldn't figure out why. I monitored the memory usage then over time and saw that it it started with a few gigabytes (~6 Gb) and kept increasing until it crashed the node when it reached the max. 128 Gb the hardware can take.
I was experimenting with random forests for classification of a large number of text documents. For simplicity -- to figure out what's going on -- I went back to naive Bayes.
The versions I am using are
Python 3.4.2
scikit-learn 0.15.2
I found some related discussion on the scikit-issue list on GitHub about this topic: https://github.com/scikit-learn/scikit-learn/issues/565 and
https://github.com/scikit-learn/scikit-learn/pull/770
And it sounds like it was already successfully addressed!
So, the relevant code that I am using is
grid_search = GridSearchCV(pipeline,
parameters,
n_jobs=1, #
cv=5,
scoring='roc_auc',
verbose=2,
pre_dispatch='2*n_jobs',
refit=False) # tried both True and False
grid_search.fit(X_train, y_train)
print('Best score: {0}'.format(grid_search.best_score_))
print('Best parameters set:')
Just out of curiosity, I later decided to do the grid search the quick & dirty way via nested for loop
for p1 in parameterset1:
for p2 in parameterset2:
...
pipeline = Pipeline([
('vec', CountVectorizer(
binary=True,
tokenizer=params_dict[i][0][0],
max_df=params_dict[i][0][1],
max_features=params_dict[i][0][2],
stop_words=params_dict[i][0][3],
ngram_range=params_dict[i][0][4],)),
('tfidf', TfidfTransformer(
norm=params_dict[i][0][5],
use_idf=params_dict[i][0][6],
sublinear_tf=params_dict[i][0][7],)),
('clf', MultinomialNB())])
scores = cross_validation.cross_val_score(
estimator=pipeline,
X=X_train,
y=y_train,
cv=5,
scoring='roc_auc',
n_jobs=1)
params_dict[i][1] = '%s,%0.4f,%0.4f' % (params_dict[i][1], scores.mean(), scores.std())
sys.stdout.write(params_dict[i][1] + '\n')
So far so good. The grid search runs and writes the results to stdout. However, after some time it exceeds the memory cap of 128 Gb again. Same problem as with the GridSearch in scikit. After some experimentation, I finally found out that
gc.collect()
len(gc.get_objects()) # particularly this part!
in the for loop solves the problem and the memory usage stays constantly at 6.5 Gb over the run time of ~10 hours.
Eventually, I got it to work with the above fix, however, I am curious to hear your ideas about what might be causing this issue and your tips & suggestions!
RandomForest in 0.15.2 does not support sparse inputs.
Upgrade sklearn and try again...hopefully this will allow the multiple copies that end up being made to consume way less memory. (and speed things up)
I can't see your exact code, but I faced similar problem nowadays.
It is worth a try.
The similar memory blow-up easily could happen when we copy values from a mutable array or list like object to an other variable creating a copy of the original one and then we modify the new array or list with append or something similar increasing the size of it and the same time increasing the original object too in the background.
So this is an exponential process, so after some time we are out of memory. I was be able to and maybe you can avoid this kind of phenomenon with deepcopy() the original object at a value passing.
I had the similar problem, I blew-up the memory with a similar process, then I managed to stay at 10% memory load.
UPDATE:
Now I see the snippet of the code with pandas DataFrame. There would be such a valuecopy issue easily.
I'm not familiar with GridSearch sir, but I'd suggest when memory and huge lists are an issue write a small custom generator. It can be reused for all your items, just use one that takes any list. If implementing beyond the lower solution here first read this article, best generator article I've found. I typed it all in and went piece by piece, any questions you have after reading it I can try too
https://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/
Don't need:
for p1 in parameterset1:
Try
def listerator(this_list):
i = 0
while True:
yield this_list[i]
i += 1
The 'yield' word (anywhere in the declaration) makes this a generator, not a regular function. This runs through and says i equals 0, while True I gotta do stuff, they want me to yield this_list[0], here you go I'll wait for you at i += 1 if you need me again. The next time it is called, it picks up and does i += 1, and notices it's still in a while loop and gives this_list[1], and records its location (i += 1 again...it will wait there until called again). Notice as I feed it the list once and make a generator (x here), it will exhaust your list.
In [141]: x = listerator([1,2,3])
In [142]: next(x)
Out[142]: 1
In [143]: next(x)
Out[143]: 2
In [144]: next(x)
Out[144]: 3
In [148]: next(x)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-148-5e4e57af3a97> in <module>()
----> 1 next(x)
<ipython-input-139-ed3d6d61a17c> in listerator(this_list)
2 i = 0
3 while True:
----> 4 yield this_list[i]
5 i += 1
6
IndexError: list index out of range
Let's see if we can use it in a for:
In [221]: for val in listerator([1,2,3,4]):
.....: print val
.....:
1
2
3
4
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-221-fa4f59138165> in <module>()
----> 1 for val in listerator([1,2,3,4]):
2 print val
3
<ipython-input-220-263fba1d810b> in listerator(this_list, seed)
2 i = seed or 0
3 while True:
----> 4 yield this_list[i]
5 i += 1
IndexError: list index out of range
Nope. Let's try to handle that:
def listerator(this_list):
i = 0
while True:
try:
yield this_list[i]
except IndexError:
break
i += 1
In [223]: for val in listerator([1,2,3,4]):
print val
.....:
1
2
3
4
That works. Now it won't blindly try to return a list element even if it isn't there. From what you said, I can almost guarantee you'll need to be able to seed it (pick up from a certain place, or start freshly from a certain place):
def listerator(this_list, seed=None):
i = seed or 0
while True:
try:
yield this_list[i]
except IndexError:
break
i += 1
In [150]: l = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
In [151]: x = listerator(l, 8)
In [152]: next(x)
Out[152]: 9
In [153]: next(x)
Out[153]: 10
In [154]: next(x)
Out[154]: 11
i = seed or 0 is a thing that looks for seed, but seed defaults to None so will usually just start at the logical place, 0, the beginning of the list
How can you use this beast without using (almost) any memory?
parameterset1 = [1,2,3,4]
parameterset2 = ['a','b','c','d']
In [224]: for p1 in listerator(parameterset1):
for p2 in listerator(parameterset2):
print p1, p2
.....:
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b
3 c
3 d
4 a
4 b
4 c
4 d
that looks familiar huh? Now you can process a trillion values one by one, picking important ones to write to disk, and never blowing up your system. Enjoy!

Categories

Resources