choosing random_State parameter in TSNE (python)

choosing random_State parameter in TSNE (python) - python

I have two questions, I am trying to plot my data with bh_sne library, however as the nature of this algorithm is based on random number in each run I have a different result. I would like to get the same result at each run. it seems that random_state is helpful.
But I do not know what exactly does it mean, by choosing different integer number for random_state.
For example what is the different between random_state=0 and random_state=1 or random_state=42 .. and random_state=None
Second, when I applied this parameter in my function and by giving any values except None I got the following error.
AttributeError: 'int' object has no attribute 'randint'
I do not have any files that named as random in my pycharm.
this is my code:
data = bh_sne(X, random_state =1 )
X contains my features values.

This lib uses numpy's random-module, more specific: this part.
Just use it like that:
import numpy as np
bh_sne(X, random_state=np.random.RandomState(0)) # init with integer 0
This can be seen with a simple source-search for random (see picture below), which also shows some unit-test!
An integer (0 above) is just some source of entropy, which results in some state of the internal random-number generator. Without analyzing the PRNG there are no guarantees how a seed-number of 0 behaves compared to 1 or 40. It does not need to be different (but often is)!
There is only one guarantee: determinism! Grabbing random-numbers from a PRNG initialized with number seed=my_integer, returns the same path / the same numbers each time this is done with this exact seed (first x numbers are equal each time; x arbitrary).
But the intro-page probably gives a more important notice (which was my first question when i saw what lib you are using while working in python):
Note: Scikit-learn v0.17 includes TSNE algorithms and you should probably be using them instead of this.

Related

Python: Computation in for loop doesn't match result of manual computation

I'm currently working on a project researching properties of some gas mixtures. Testing my code with different inputs, I came upon a bug(?) which I fail to be able to explain. Basically, it's concerning a computation on a numpy array in a for loop. When it computed the for-loop, it yields a different (and wrong) result as opposed to the manual construction of the result, using the same exact code snippets as in the for-loop, but indexing manually. I have no clue, why it is happening and whether it is my own mistake, or a bug within numpy.
It's super weird, that certain instances of the desired input objects run through the whole for loop without any problem, while others run perfectly up to a certain index and others fail to even compute the very first loop.
For instance, one input always stopped at index 16, throwing a:
ValueError: could not broadcast input array from shape (25,) into shape (32,)
Upon further investigation I could confirm, that the previous 15 loops threw the correct results, the results in loop of index 16 were wrong and not even of the correct size. When running loop 16 manually through the console, no errors occured...
The lower array shows the results for index 16, when it's running in the loop.
These are the results for index 16, when running the code in the for loop manually in the console. These are, what one would expect to get.
The important part of the code is really only the np.multiply() in the for loop - I left the rest of it for context but am pretty sure it shouldn't interfere with my intentions.
def thermic_dissociation(input_gas, pressure):
# Copy of the input_gas object, which may not be altered out of scope
gas = copy.copy(input_gas)
# Temperature range
T = np.logspace(2.473, 4.4, 1000)
# Matrix containing the data over the whole range of interest
moles = np.zeros((gas.gas_cantera.n_species, len(T)))
# Array containing other property of interest
sum_particles = np.zeros(len(T))
# The troublesome for-loop:
for index in range(len(T)):
print(str(index) + ' start')
# Set temperature and pressure of the gas
gas.gas_cantera.TP = T[index], pressure
# Set gas mixture to a state of chemical equilibrium
gas.gas_cantera.equilibrate('TP')
# Sum of particles = Molar Density * Avogadro constant for every temperature
sum_particles[index] = gas.gas_cantera.density_mole * ct.avogadro
#This multiplication is doing the weird stuff, printed it to see what's computed before it puts it into the result matrix and throwing the error
print(np.multiply(list(gas.gas_cantera.mole_fraction_dict().values()), sum_particles[index]))
# This is where the error is thrown, as the resulting array is of smaller size, than it should be and thus resulting in the error
moles[:, index] = np.multiply(list(gas.gas_cantera.mole_fraction_dict().values()), sum_particles[index])
print(str(index) + ' end')
# An array helping to handle the results
molecule_order = list(gas.gas_cantera.mole_fraction_dict().keys())
return [moles, sum_particles, T, molecule_order]
Help will be very appreciated!

If you want the array of all species mole fractions, you should use the X property of the cantera.Solution object, which always returns that full array directly. You can see the documentation for that method: cantera.Solution.X`.
The mole_fraction_dict method is specifically meant for cases where you want to refer to the species by name, rather than their order in the Solution object, such as when relating two different Solution objects that define different sets of species.

This particular issue is not related to numpy. The call to mole_fraction_dict returns a standard python dictionary. The number of elements in the dictionary depends on the optional threshold argument, which has a default value of 0.0.
The source code of Cantera can be inspected to see what happens exactly.
mole_fraction_dict
getMoleFractionsByName
In other words, a value ends up in the dictionary if x > threshold. Maybe it would make more sense if >= was used here instead of >. And maybe this would have prevented the unexpected outcome in your case.
As confirmed in the comments, you can use mole_fraction_dict(threshold=-np.inf) to get all of the desired values in the dictionary. Or -float('inf') can also be used.
In your code you proceed to call .values() on the dictionary but this would be problematic if the order of the values is not guaranteed. I'm not sure if this is the case. It might be better to make the order explicit by retrieving values out of the dict using their key.

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.

As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.

Unable to understand constant train,test split using hashlib

I was doing chapter 1 in "Hands-on Machine Learning in sci-kit learn and Tensor flow"
and I came across code using hashlib which splits test train data from our dataframe.The code is shown below:
"""
Creating shuffled testset with constant values in training and updated dataset values going to
test set in case dataset is updated, this done via hashlib
"""
import hashlib
import numpy as np
def test_set_check(identifier,test_ratio,hash):
return hash(np.int64(identifier)).digest()[-1]<256*test_ratio
def split_train_test(data,test_ratio,id_column,hash=hashlib.md5):
ids=data[id_column]
in_test_set=ids.apply(lambda id_:test_set_check(id_,test_ratio,hash))
return data.loc[~in_test_set],data.loc[in_test_set]
I want to understand why:
This code digest()[-1] gives an integer even though output of .digest() gives us a hashcode
Why the output is compared to a constant of 256 in the code < 256*test_ratio
How is it more robust than using np.random.seed(42)
I am a newbie to all this so it would be great if you could help me figure this out

The hashlib.hash.digest method returns a series of bytes. Each byte is a number from 0 to 255. In this particular example, the size of the hashcode is 16, and indexing a particular location in the hashcode returns the value of that particular byte. As the book explains, the author is proposing to use the last byte of the hash as an identifier.
The reason that the output is compared to 256 * test_ratio is because we want to keep only a specific number of samples in our test set that is consistent with test_ratio. Since each byte value is between 0 and 255, comparing it to 256 * test_ratio is essentially setting a "cap" to decide whether or not to keep a sample. For example, if you take the provided housing data and perform the hashing and splitting, you'll notice that you'll end up with a test set of around 4,000 samples which is roughly 20% of the training set. If you're having trouble understanding, imagine we have a list of integers ranging from 0 to 100. Keeping only the integers that are smaller than 100 * 0.2 = 20 will ensure that we end up with 20% of the samples.
As the book explains, the entire motivation of going through this rather cumbersome hashing process is to mitigate data snooping bias, which refers to the phenomenon of our model having access to the test set. In order to do so, we want to make sure that we always have the same samples in our test set. Simply setting a random seed and then splitting the dataset isn't going to guarantee this, but using hash values will.

Combining p values using scipy

I have to combine p values and get one p value.
I'm using scipy.stats.combine_pvalues function, but it is giving very small combined p value, is it normal?
e.g.:
>>> import scipy
>>> p_values_list=[8.017444955844044e-06, 0.1067379119652372, 5.306374345615846e-05, 0.7234201655194492, 0.13050605094545614, 0.0066989543716175, 0.9541246420333787]
>>> test_statistic, combined_p_value = scipy.stats.combine_pvalues(p_values_list, method='fisher',weights=None)
>>> combined_p_value
4.331727536209026e-08
As you see, combined_p_value is smaller than any given p value in the p_values_list?
How can it be?
Thanks in advance,
Burcak

It is correct, because you are testing all of your p-values come from a random uniform distribution. The alternate hypothesis is that at least one of them is true. Which in your case is very possible.
We can simulate this, by drawing from a random uniform distribution 1000 times, the length of your p-values:
import numpy as np
from scipy.stats import combine_pvalues
from matplotlib import pyplot as plt
random_p = np.random.uniform(0,1,(1000,len(p_values_list)))
res = np.array([combine_pvalues(i,method='fisher',weights=None) for i in random_p])
plt.hist(fisher_p)
From your results, the chi-square is 62.456 which is really huge and no where near the simulated chi-square above.
One thing to note is that the combining you did here does not take into account directionality, if that is possible in your test, you might want to consider using stouffer's Z along with weights. Also another sane way to check is to run simulation like the above, to generate list of p-values under the null hypothesis and see how they differ from what you observed.
Interesting paper but maybe a bit on the statistics side

I am by no means an expert in this field, but am interested in your question. Following some reading of wiki it seems to me that the combined_p_value tells you the likelihood of all p-values in the list been obtained under the same null-hypothesis. Which is very unlikely considering two extremely small values.
Your set has two extremely small values: 1st and 3rd. If the thought process I described is correct, removing any of them should yield a much higher p-value, which is indeed the case:
remove 1st: p-value of 0.00010569305282803985
remove 3rd: p-value of 2.4713196031837724e-05
In conclusion, I think that this is a correct way of interpreting the meta-analysis that combine_pvalues actually describes.

What are equivalents to R's "phyper" function in Python?

In R, I use the phyper function to do a hypergeometric test for bioinformatics analysis. However I use a lot of Python code and using rpy2 here is quite slow. So, I started looking for alternatives. It seemed that scipy.stats.hypergeom had something similar.
Currently, I call phyper like this:
pvalue <- 1-phyper(45, 92, 7518, 1329)
where 45 is the number of selected items having the property of interest, 92 the number of total items having the property, 7518 the number of non selected items not having the property, and 1329 the total number of selected items.
In R, this yields 6.92113e-13.
Attempting to do the same with scipy.stats.hypergeom however yields a completely different result (notice, the numbers are swapped because the function accepts numbers in a different way):
import scipy.stats as stats
pvalue = 1-stats.hypergeom.cdf(45, 7518, 92. 1329)
print pvalue
However this returns -7.3450134863151106e-12, which makes little sense. Notice that I've tested this on other data and I had little issues (same precision up to the 4th decimal, which is enough for me).
So it boils down to these possibilities:
I'm using the wrong function for the job (or wrong parameters)
There's a bug in scipy
In case of "1", are there other alternatives to phyper that can be used in Python?
EDIT: As the comments have noted, this is a bug in scipy, fixed in git master.

From the docs, you could try:
hypergeom.sf(x,M,n,N,loc=0) :
survival function (1-cdf — sometimes
more accurate)
Also, I think you might have the values mixed up.
Models drawing objects from a bin. M
is total number of objects, n is total
number of Type I objects. RV counts
number of Type I objects in N drawn
without replacement from population.
Therefore, by my reading: x=q, M=n+m, n=m, N=k.
So I would try:
stats.hypergeom.sf(45,(92+7518),92,1329)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

choosing random_State parameter in TSNE (python) - python

Related

Python: Computation in for loop doesn't match result of manual computation

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

Unable to understand constant train,test split using hashlib

Combining p values using scipy

What are equivalents to R's "phyper" function in Python?

Categories

Resources