Unexpected behavior when using R sample function with rpy2?

Unexpected behavior when using R sample function with rpy2? - python

I need to cross-validate an R code in python. My code contains lots of pseudo-random number generations, so, for an easier comparison, I decided to use rpy2 to generate those values in my python code "from R".
As an example, in R, I have:
set.seed(1234)
runif(4)
[1] 0.1137034 0.6222994 0.6092747 0.6233794
In python, using rpy2, I have:
import rpy2.robjects as robjects
set_seed = robjects.r("set.seed")
runif = robjects.r("runif")
set_seed(1234)
print(runif(4))
[1] 0.1137034 0.6222994 0.6092747 0.6233794
as expected (values are similar). However, I face a strange behavior with the R sample function (equivalent to the numpy.random.choice function).
As the simplest reproducible example, I have in R:
set.seed(1234)
sample(5)
[1] 1 3 2 4 5
while in python I have:
sample = robjects.r("sample")
set_seed(1234)
print(sample(5))
[1] 4 5 2 3 1
The results are different. Could anyone explain why this happens and/or provide a way to get similar values in R and python using the R sample function?

If you print the value of the R function RNGkind() in both situations, I suspect you won't get the same answer. The Python result looks like the default output, while your R result looks like the old buggy output.
For example, in R:
set.seed(1234, sample.kind = "Rejection")
sample(5)
#> [1] 4 5 2 3 1
set.seed(1234, sample.kind = "Rounding")
#> Warning in set.seed(1234, sample.kind = "Rounding"): non-uniform 'Rounding'
#> sampler used
sample(5)
#> [1] 1 3 2 4 5
set.seed(1234, sample.kind = "default")
sample(5)
#> [1] 4 5 2 3 1
Created on 2021-01-15 by the reprex package (v0.3.0)
So it looks to me as though you are still using the old "Rounding" method in your R session. You probably saved a workspace a long time ago, and have reloaded it since. Don't do that, start with a clean workspace each session.

Maybe give this a shot (stackoverflow answer from here). Quoting the answer : "The p argument corresponds to the prob argument in the sample()function"
import numpy as np
np.random.choice(a, size=None, replace=True, p=None)

Related

pandas value_counts() returning incorrect counts

I was wondering if anyone else has ever experienced value_counts() returning incorrect counts. I have two variables, Pass and Fail, and when I use value_counts() it is returning the correct total but the wrong number for each variable.
The data in the data frame is for samples made with different sample preparation methods (A-G) and then tested on different testing machines (numbered 1-5; they run the same test we just have 5 different ones so we can run more tests) and I am trying to compare both the method and testers by putting the pass % into a pivot table. I would like to be able to do this for different sample materials as well so I have been trying to write the pass % function in a separate script so that I can call it to each material's script if that makes sense.
The pass % function is as follows:
def pass_percent(df_copy):
pds = df_copy.value_counts()
p = pds['PASS']
try:
f = pds['FAIL']
except:
f = 0
print(pds)
print(p)
print(f)
pass_pc = p/(p+f) *100
print(pass_pc)
return pass_pc
And then within the individual material script (e.g. material 1A) I have (among a few other things to tidy up the data frame before this - essentially getting rid of columns I don't need from the testing outputs):
from pass_pc_function import pass_percent
mat_1A = pd.pivot_table(df_copy, index='Prep_Method', columns='Test_Machine', aggfunc=pass_percent)
An example of what is happening is, for Material 1A I have 100 tests of Prep_Method A on Test_Machine 1 of which 65 passed and 35 failed, so a 65% pass rate. But value_counts() is returning 56 passes and 44 fails (so the total is still 100 which is correct but for some reason it is counting 9 passes as fails). This is just an example, I have much larger data sets than this but this is essentially what is happening.
I thought perhaps it could be a white space issue so I also have the line:
df_copy.columns = [x.strip() for x in df_copy.columns]
in my M1A script. However I am still getting this strange error.
Any advice would be appreciated. Thanks!
EDIT:
Results example as requested
PASS 31
FAIL 27
Name: Result, dtype: int64
31
27
53.44827586206896
Result
Test_Machine 1 2 3 4
Prep_Method
A 53.448276 89.655172 93.478261 97.916667
B 87.050360 90.833333 91.596639 97.468354
C 83.333333 93.150685 98.305085 100.000000
D 85.207101 94.339623 95.652174 97.163121
E 87.901701 96.310680 95.961538 98.655462
F 73.958333 82.178218 86.166008 93.750000
G 80.000000 91.743119 89.622642 98.529412

tf.random.categorical giving strange results

I am trying to implement np.random.choice in tensorflow. Here is my implementation
import numpy as np
import tensorflow as tf
p=tf.Variable(0,tf.int32)
selection_sample=[i for i in range(10)]#sample to select from
k=tf.convert_to_tensor(selection_sample)
samples = tf.random.categorical(tf.math.log([[1, 0.5, 0.3, 0.6]]),1)
sample_selected=tf.cast(samples[0][0],tf.int64)
op=tf.assign(p,k[sample_selected])
#selection_sample[samples]
init=tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sample_selected.eval())
print(k.eval())
print((sess.run(op)))
print(p.eval())
However when sample_selected is for example 1 i expect p.eval to be 1 i.e k[1] but this is not the case. For example running this code a sample output is
3
[0 1 2 3 4 5 6 7 8 9]
1
1
yet p.eval should be k[3] and sess.run(op) should also be k[3]
what I am doing wrong. Thanks

When you do:
print(sample_selected.eval())
You get a random value derived from tf.random.categorical. That random value is returned by the session and not saved anywhere else.
Then, when you do:
print((sess.run(op)))
You are assigning the variable p a new random value produced in this call to run. That is the value printed, which is now saved in the variable.
Finally, when you do:
print(p.eval())
You see the value currently stored in p, which is the random value generated in the previous call to run.

Sympy coeff not consistent with as_independent

I have the following snippet of code
import sympy
a = sympy.symbols('a')
b = sympy.symbols('b')
c = sympy.symbols('c')
print((a*b).coeff(c,0))
print((a*b).as_independent(c)[0])
I don't understand why the two print statements print different output. According to the documentation of coeff:
You can select terms independent of x by making n=0; in this case
expr.as_independent(x)[0] is returned (and 0 will be returned instead
of None):
>>> (3 + 2*x + 4*x**2).coeff(x, 0)
3
Is this a bug in sympy, or do I miss something?

It's a bug. I have a pull request fixing it here.

Python: "Misbehaving" insert function in a loop

I wrote the following bit of code:
...
for x in range(len(coeff)): coeff[x].insert(0,names[x])
coeff.insert(0,['Center','c1','c2','c3'])
print_matrix(coeff)
...
The print_matrix function just prints a nice matrix from a tuple [[row1],[row2],[etc...]].
my coeff = [[1,2,3],[4,5,6]] and my names = ['A,'B'].
The first time i run the function I get:
coeff = [['Center','c1','c2','c3'],['A',1,2,3],[B,4,5,6]]
+----------------------+
| Center c1 c2 c3 |
| A 1 2 3 |
| B 4 5 6 |
+----------------------+
which is exactly what I want. The problem starts when I run THE SAME (copied and pasted) script just after the first one to print in a similar fashion another tuple basis = [[7,8,9],[10,11,12]]:
...
del x
for x in range(len(basis)): basis[x].insert(0,names[x])
basis.insert(0,['Center','A1','A2','A3'])
print_matrix(basis)
...
I then get:
basis = [['Center','A1','A2','A3'],['A','B',7,8,9],['A','B',10,11,12]]
and an error from the print_matrix functions since it doesn't get a tuple with equal lenght rows. Why?

Ok, I worked it out. What happened was that the way basis was constructed in the fist place affected the functions. I just gave random numbers as an example of basis but in fact it was (deep in the code):
coordinates = [...,[1,2,3],...]
coordinates[7] = [1,2,3] # Or something like that
basis = []
basis.append(coordinates[7])
...
basis.append(coordinates[7])
so that when I did insert(0,something) on basis[0], it also inserted element into basis[1].
Here is a strip of code that works:
...
basis_clone = [[y for y in basis[x]] for x in range(len(basis))]
for y, name in zip(basis_clone,orbital_center_names): y.insert(0,name)
basis_clone.insert(0,['Center','A1','A2','A3'])
print_matrix(basis_clone) ; sleep(0.1)
...
None of the methods given here worked so I had to clone the basis in the way I did. I'm open for suggestions of a better way to do that though.
P.S.: Thank you to #Lattyware for help on good syntax.

What statistics module for python supports one way ANOVA with post hoc tests (Tukey, Scheffe or other)?

I have tried looking through multiple statistics modules for Python but can't seem to find any that support one-way ANOVA post hoc tests.

one way ANOVA can be used like
from scipy import stats
f_value, p_value = stats.f_oneway(data1, data2, data3, data4, ...)
This is one way ANOVA and it returns F value and P value.
There is significant difference If the P value is below your setting.
The Tukey-kramer HSD test can be used like
from statsmodels.stats.multicomp import pairwise_tukeyhsd
print pairwise_tukeyhsd(Data, Group)
This is multicomparison.
The output is like
Multiple Comparison of Means - Tukey HSD,FWER=0.05
================================================
group1 group2 meandiff lower upper reject
------------------------------------------------
0 1 -35.2153 -114.8741 44.4434 False
0 2 46.697 -40.4993 133.8932 False
0 3 -7.5709 -87.49 72.3482 False
1 2 81.9123 5.0289 158.7956 True
1 3 27.6444 -40.8751 96.164 False
2 3 -54.2679 -131.4209 22.8852 False
------------------------------------------------
Please refer to this site how to set the arguments.
The tukeyhsd of statsmodels doesn't return P value.
So, if you want to know P value, calculate from these outputted value or use R.

I think that the library Pyvttbl returns a ANOVA table including post-hoc tests (i.e., TukeyHSD). In fact, what is neat with Pyvttbl is that you can carry out ANOVA for repeated measures also.
See the doc for Anova1way here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unexpected behavior when using R sample function with rpy2? - python

Maybe give this a shot (stackoverflow answer from here). Quoting the answer : "The p argument corresponds to the prob argument in the sample()function" import numpy as np np.random.choice(a, size=None, replace=True, p=None)

Related

pandas value_counts() returning incorrect counts

tf.random.categorical giving strange results

Sympy coeff not consistent with as_independent

Python: "Misbehaving" insert function in a loop

What statistics module for python supports one way ANOVA with post hoc tests (Tukey, Scheffe or other)?

Categories

Resources