Unexpected behavior when using R sample function with rpy2? - python

I need to cross-validate an R code in python. My code contains lots of pseudo-random number generations, so, for an easier comparison, I decided to use rpy2 to generate those values in my python code "from R".
As an example, in R, I have:
set.seed(1234)
runif(4)
[1] 0.1137034 0.6222994 0.6092747 0.6233794
In python, using rpy2, I have:
import rpy2.robjects as robjects
set_seed = robjects.r("set.seed")
runif = robjects.r("runif")
set_seed(1234)
print(runif(4))
[1] 0.1137034 0.6222994 0.6092747 0.6233794
as expected (values are similar). However, I face a strange behavior with the R sample function (equivalent to the numpy.random.choice function).
As the simplest reproducible example, I have in R:
set.seed(1234)
sample(5)
[1] 1 3 2 4 5
while in python I have:
sample = robjects.r("sample")
set_seed(1234)
print(sample(5))
[1] 4 5 2 3 1
The results are different. Could anyone explain why this happens and/or provide a way to get similar values in R and python using the R sample function?

If you print the value of the R function RNGkind() in both situations, I suspect you won't get the same answer. The Python result looks like the default output, while your R result looks like the old buggy output.
For example, in R:
set.seed(1234, sample.kind = "Rejection")
sample(5)
#> [1] 4 5 2 3 1
set.seed(1234, sample.kind = "Rounding")
#> Warning in set.seed(1234, sample.kind = "Rounding"): non-uniform 'Rounding'
#> sampler used
sample(5)
#> [1] 1 3 2 4 5
set.seed(1234, sample.kind = "default")
sample(5)
#> [1] 4 5 2 3 1
Created on 2021-01-15 by the reprex package (v0.3.0)
So it looks to me as though you are still using the old "Rounding" method in your R session. You probably saved a workspace a long time ago, and have reloaded it since. Don't do that, start with a clean workspace each session.

Maybe give this a shot (stackoverflow answer from here). Quoting the answer : "The p argument corresponds to the prob argument in the sample()function"
import numpy as np
np.random.choice(a, size=None, replace=True, p=None)

Related

pandas value_counts() returning incorrect counts

I was wondering if anyone else has ever experienced value_counts() returning incorrect counts. I have two variables, Pass and Fail, and when I use value_counts() it is returning the correct total but the wrong number for each variable.
The data in the data frame is for samples made with different sample preparation methods (A-G) and then tested on different testing machines (numbered 1-5; they run the same test we just have 5 different ones so we can run more tests) and I am trying to compare both the method and testers by putting the pass % into a pivot table. I would like to be able to do this for different sample materials as well so I have been trying to write the pass % function in a separate script so that I can call it to each material's script if that makes sense.
The pass % function is as follows:
def pass_percent(df_copy):
pds = df_copy.value_counts()
p = pds['PASS']
try:
f = pds['FAIL']
except:
f = 0
print(pds)
print(p)
print(f)
pass_pc = p/(p+f) *100
print(pass_pc)
return pass_pc
And then within the individual material script (e.g. material 1A) I have (among a few other things to tidy up the data frame before this - essentially getting rid of columns I don't need from the testing outputs):
from pass_pc_function import pass_percent
mat_1A = pd.pivot_table(df_copy, index='Prep_Method', columns='Test_Machine', aggfunc=pass_percent)
An example of what is happening is, for Material 1A I have 100 tests of Prep_Method A on Test_Machine 1 of which 65 passed and 35 failed, so a 65% pass rate. But value_counts() is returning 56 passes and 44 fails (so the total is still 100 which is correct but for some reason it is counting 9 passes as fails). This is just an example, I have much larger data sets than this but this is essentially what is happening.
I thought perhaps it could be a white space issue so I also have the line:
df_copy.columns = [x.strip() for x in df_copy.columns]
in my M1A script. However I am still getting this strange error.
Any advice would be appreciated. Thanks!
EDIT:
Results example as requested
PASS 31
FAIL 27
Name: Result, dtype: int64
31
27
53.44827586206896
Result
Test_Machine 1 2 3 4
Prep_Method
A 53.448276 89.655172 93.478261 97.916667
B 87.050360 90.833333 91.596639 97.468354
C 83.333333 93.150685 98.305085 100.000000
D 85.207101 94.339623 95.652174 97.163121
E 87.901701 96.310680 95.961538 98.655462
F 73.958333 82.178218 86.166008 93.750000
G 80.000000 91.743119 89.622642 98.529412

tf.random.categorical giving strange results

I am trying to implement np.random.choice in tensorflow. Here is my implementation
import numpy as np
import tensorflow as tf
p=tf.Variable(0,tf.int32)
selection_sample=[i for i in range(10)]#sample to select from
k=tf.convert_to_tensor(selection_sample)
samples = tf.random.categorical(tf.math.log([[1, 0.5, 0.3, 0.6]]),1)
sample_selected=tf.cast(samples[0][0],tf.int64)
op=tf.assign(p,k[sample_selected])
#selection_sample[samples]
init=tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sample_selected.eval())
print(k.eval())
print((sess.run(op)))
print(p.eval())
However when sample_selected is for example 1 i expect p.eval to be 1 i.e k[1] but this is not the case. For example running this code a sample output is
3
[0 1 2 3 4 5 6 7 8 9]
1
1
yet p.eval should be k[3] and sess.run(op) should also be k[3]
what I am doing wrong. Thanks
When you do:
print(sample_selected.eval())
You get a random value derived from tf.random.categorical. That random value is returned by the session and not saved anywhere else.
Then, when you do:
print((sess.run(op)))
You are assigning the variable p a new random value produced in this call to run. That is the value printed, which is now saved in the variable.
Finally, when you do:
print(p.eval())
You see the value currently stored in p, which is the random value generated in the previous call to run.

Sympy coeff not consistent with as_independent

I have the following snippet of code
import sympy
a = sympy.symbols('a')
b = sympy.symbols('b')
c = sympy.symbols('c')
print((a*b).coeff(c,0))
print((a*b).as_independent(c)[0])
I don't understand why the two print statements print different output. According to the documentation of coeff:
You can select terms independent of x by making n=0; in this case
expr.as_independent(x)[0] is returned (and 0 will be returned instead
of None):
>>> (3 + 2*x + 4*x**2).coeff(x, 0)
3
Is this a bug in sympy, or do I miss something?
It's a bug. I have a pull request fixing it here.

Python: "Misbehaving" insert function in a loop

I wrote the following bit of code:
...
for x in range(len(coeff)): coeff[x].insert(0,names[x])
coeff.insert(0,['Center','c1','c2','c3'])
print_matrix(coeff)
...
The print_matrix function just prints a nice matrix from a tuple [[row1],[row2],[etc...]].
my coeff = [[1,2,3],[4,5,6]] and my names = ['A,'B'].
The first time i run the function I get:
coeff = [['Center','c1','c2','c3'],['A',1,2,3],[B,4,5,6]]
+----------------------+
| Center c1 c2 c3 |
| A 1 2 3 |
| B 4 5 6 |
+----------------------+
which is exactly what I want. The problem starts when I run THE SAME (copied and pasted) script just after the first one to print in a similar fashion another tuple basis = [[7,8,9],[10,11,12]]:
...
del x
for x in range(len(basis)): basis[x].insert(0,names[x])
basis.insert(0,['Center','A1','A2','A3'])
print_matrix(basis)
...
I then get:
basis = [['Center','A1','A2','A3'],['A','B',7,8,9],['A','B',10,11,12]]
and an error from the print_matrix functions since it doesn't get a tuple with equal lenght rows. Why?
Ok, I worked it out. What happened was that the way basis was constructed in the fist place affected the functions. I just gave random numbers as an example of basis but in fact it was (deep in the code):
coordinates = [...,[1,2,3],...]
coordinates[7] = [1,2,3] # Or something like that
basis = []
basis.append(coordinates[7])
...
basis.append(coordinates[7])
so that when I did insert(0,something) on basis[0], it also inserted element into basis[1].
Here is a strip of code that works:
...
basis_clone = [[y for y in basis[x]] for x in range(len(basis))]
for y, name in zip(basis_clone,orbital_center_names): y.insert(0,name)
basis_clone.insert(0,['Center','A1','A2','A3'])
print_matrix(basis_clone) ; sleep(0.1)
...
None of the methods given here worked so I had to clone the basis in the way I did. I'm open for suggestions of a better way to do that though.
P.S.: Thank you to #Lattyware for help on good syntax.

What statistics module for python supports one way ANOVA with post hoc tests (Tukey, Scheffe or other)?

I have tried looking through multiple statistics modules for Python but can't seem to find any that support one-way ANOVA post hoc tests.
one way ANOVA can be used like
from scipy import stats
f_value, p_value = stats.f_oneway(data1, data2, data3, data4, ...)
This is one way ANOVA and it returns F value and P value.
There is significant difference If the P value is below your setting.
The Tukey-kramer HSD test can be used like
from statsmodels.stats.multicomp import pairwise_tukeyhsd
print pairwise_tukeyhsd(Data, Group)
This is multicomparison.
The output is like
Multiple Comparison of Means - Tukey HSD,FWER=0.05
================================================
group1 group2 meandiff lower upper reject
------------------------------------------------
0 1 -35.2153 -114.8741 44.4434 False
0 2 46.697 -40.4993 133.8932 False
0 3 -7.5709 -87.49 72.3482 False
1 2 81.9123 5.0289 158.7956 True
1 3 27.6444 -40.8751 96.164 False
2 3 -54.2679 -131.4209 22.8852 False
------------------------------------------------
Please refer to this site how to set the arguments.
The tukeyhsd of statsmodels doesn't return P value.
So, if you want to know P value, calculate from these outputted value or use R.
I think that the library Pyvttbl returns a ANOVA table including post-hoc tests (i.e., TukeyHSD). In fact, what is neat with Pyvttbl is that you can carry out ANOVA for repeated measures also.
See the doc for Anova1way here

Categories

Resources