Unable to understand constant train,test split using hashlib

Unable to understand constant train,test split using hashlib - python

I was doing chapter 1 in "Hands-on Machine Learning in sci-kit learn and Tensor flow"
and I came across code using hashlib which splits test train data from our dataframe.The code is shown below:
"""
Creating shuffled testset with constant values in training and updated dataset values going to
test set in case dataset is updated, this done via hashlib
"""
import hashlib
import numpy as np
def test_set_check(identifier,test_ratio,hash):
return hash(np.int64(identifier)).digest()[-1]<256*test_ratio
def split_train_test(data,test_ratio,id_column,hash=hashlib.md5):
ids=data[id_column]
in_test_set=ids.apply(lambda id_:test_set_check(id_,test_ratio,hash))
return data.loc[~in_test_set],data.loc[in_test_set]
I want to understand why:
This code digest()[-1] gives an integer even though output of .digest() gives us a hashcode
Why the output is compared to a constant of 256 in the code < 256*test_ratio
How is it more robust than using np.random.seed(42)
I am a newbie to all this so it would be great if you could help me figure this out

The hashlib.hash.digest method returns a series of bytes. Each byte is a number from 0 to 255. In this particular example, the size of the hashcode is 16, and indexing a particular location in the hashcode returns the value of that particular byte. As the book explains, the author is proposing to use the last byte of the hash as an identifier.
The reason that the output is compared to 256 * test_ratio is because we want to keep only a specific number of samples in our test set that is consistent with test_ratio. Since each byte value is between 0 and 255, comparing it to 256 * test_ratio is essentially setting a "cap" to decide whether or not to keep a sample. For example, if you take the provided housing data and perform the hashing and splitting, you'll notice that you'll end up with a test set of around 4,000 samples which is roughly 20% of the training set. If you're having trouble understanding, imagine we have a list of integers ranging from 0 to 100. Keeping only the integers that are smaller than 100 * 0.2 = 20 will ensure that we end up with 20% of the samples.
As the book explains, the entire motivation of going through this rather cumbersome hashing process is to mitigate data snooping bias, which refers to the phenomenon of our model having access to the test set. In order to do so, we want to make sure that we always have the same samples in our test set. Simply setting a random seed and then splitting the dataset isn't going to guarantee this, but using hash values will.

Related

How can I Encode a pink noise signal to spikes with snntorch.spikegen.latency?

I have 2 pink noise signals created with a random generator
and I put that into a for-loop like:
for i in range(1000):
input[i] = numpy.random.uniform(-1,1)
for i in range(1000):
z[i] = z[i-1] + (1-b)*(z[i] - input[i-1])
Now I try to convert this via the snntorch library. I already used the rate coding part of this library and want to compare it with the latency coding part. So I want to use snntorch.spikegen.latency() but I don't know how to use it right. I changed all the parameters and got no good result.
Do you have any tips for the Encoding/Decoding part to convert this noise into a spike train and convert it back?
Thanks to everyone!

Can you share how you're currently trying to use the latency() function?
It should be similar to rate() in that you just pass z to the latency function. Though there are many more options involved (e.g., normalize=True finds the time constant to ensure all spike times occur within the range of time num_steps).
Each element in z will correspond to one spike. So if it is of dimension N, then the output should be T x N.
The value/intensity of the element corresponds to what time that spike occurs. Negative intensities don't make sense here, so either take the absolute value of z before passing it in, or level shift it.

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.

As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.

Hypothesis equivalent of QuickCheck frequency generator?

As a learning project I am translating some Haskell code (which I'm unfamiliar with) into Python (which I know well)...
The Haskell library I'm translating has tests which make use of QuickCheck property-based testing. On the Python side I am using Hypothesis as the property-based testing library.
The Haskell tests make use of a helper function which looks like this:
mkIndent' :: String -> Int -> Gen String
mkIndent' val size = concat <$> sequence [indent, sym, trailing]
where
whitespace_char = elements " \t"
trailing = listOf whitespace_char
indent = frequency [(5, vectorOf size whitespace_char), (1, listOf whitespace_char)]
sym = return val
My question is specifically about the frequency generator in this helper.
http://hackage.haskell.org/package/QuickCheck-2.12.6.1/docs/Test-QuickCheck-Gen.html#v:frequency
I understand it to mean that most of the time it will return vectorOf whitespace_char with the expected size, but 1 in 5 times it will return listOf whitespace_char which could be any length including zero.
In the context of the library, an indent which does not respect the size would model bad input data for the function under test. So I see the point of occasionally producing such an input.
What I currently don't understand is why the 5:1 ratio in favour of valid inputs? I would have expected the property-based test framework to generate various valid and invalid inputs. For now I assume that this is sort of like an optimisation, so that it doesn't spend most of its time generating invalid examples?
The second part of my question is how to translate this into Hypothesis. AFAICT Hypothesis does not have any equivalent of the frequency generator.
I am wondering whether I should attempt to build a frequency strategy myself from existing Hypothesis strategies, or if the idiom itself is not worth translating and I should just let the framework generate valid & invalid examples alike?
What I have currently is:
from hypothesis import strategies as st
#st.composite
def make_indent_(draw, val, size):
"""
Indent `val` by `size` using either space or tab.
Will sometimes randomly ignore `size` param.
"""
whitespace_char = st.text(' \t', min_size=1, max_size=1)
trailing = draw(st.text(draw(whitespace_char)))
indent = draw(st.one_of(
st.text(draw(whitespace_char), min_size=size, max_size=size),
st.text(draw(whitespace_char)),
))
return ''.join([indent, val, trailing])
If I generate a few examples in a shell this seems to be doing exactly what I think it should.
But this is my first use of Hypothesis or property-based testing and I am wondering if I am losing something vital by replacing the frequency distribution with a simple one_of?

As far as I can see, you've correctly understood the purpose of using frequency here. It is used to allow the occasional mis-sized indent instead of either (1) only generating correctly sized indents which would never test bad indent sizes; or (2) generating randomly sized indents which would test bad indents over and over again but only generate a fraction of cases with good indents to test other aspects of the code.
Now, the 5:1 ratio of good to (potentially) bad indent sizes is probably quite arbitrary, and it's hard to know if 1:1 or 10:1 would have been better choices without seeing the details of what's being tested.
Luckily though, with respect to porting this to hypothesis, the answer to Have a Strategy that does not uniformly choose between different strategies includes a deleted comment:
Hypothesis doesn't actually support user-specific probabilities - we start with a uniform distribution, but bias it based on coverage from observed inputs. [...] – Zac Hatfield-Dodds Apr 15 '18 at 3:43
This suggests that the "hypothesis" package automatically adjusts weights when using one_of to increase coverage, meaning that it may automatically up-weight the case with correct size in your make_indent_ implementation, making it a sort of automatic version of frequency.

choosing random_State parameter in TSNE (python)

I have two questions, I am trying to plot my data with bh_sne library, however as the nature of this algorithm is based on random number in each run I have a different result. I would like to get the same result at each run. it seems that random_state is helpful.
But I do not know what exactly does it mean, by choosing different integer number for random_state.
For example what is the different between random_state=0 and random_state=1 or random_state=42 .. and random_state=None
Second, when I applied this parameter in my function and by giving any values except None I got the following error.
AttributeError: 'int' object has no attribute 'randint'
I do not have any files that named as random in my pycharm.
this is my code:
data = bh_sne(X, random_state =1 )
X contains my features values.

This lib uses numpy's random-module, more specific: this part.
Just use it like that:
import numpy as np
bh_sne(X, random_state=np.random.RandomState(0)) # init with integer 0
This can be seen with a simple source-search for random (see picture below), which also shows some unit-test!
An integer (0 above) is just some source of entropy, which results in some state of the internal random-number generator. Without analyzing the PRNG there are no guarantees how a seed-number of 0 behaves compared to 1 or 40. It does not need to be different (but often is)!
There is only one guarantee: determinism! Grabbing random-numbers from a PRNG initialized with number seed=my_integer, returns the same path / the same numbers each time this is done with this exact seed (first x numbers are equal each time; x arbitrary).
But the intro-page probably gives a more important notice (which was my first question when i saw what lib you are using while working in python):
Note: Scikit-learn v0.17 includes TSNE algorithms and you should probably be using them instead of this.

Most suitable clustering method for a dataset containing 10 dimension numerical arrays

I have a data set (~4k samples) of the following structure:
sample type: string - very general
sample sub type: string
sample model number: number - may be None
signature: number array[10]
sampleID: string - unique id
I want to cluster the samples based on the "signature" (I have a function that measures "distance" between one signature to another). So that when I'll encounter a new signature I'll be able to tell to which type/sub type the sample belongs to.
Which algorithm should I use?
P.S. (I am using python and scikit-learn), I also need to somehow visualize the results.

Since you already have a distance function, and yoour data set is tiny, just use HAC, the grandfather of all clustering algorithms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to understand constant train,test split using hashlib - python

Related

How can I Encode a pink noise signal to spikes with snntorch.spikegen.latency?

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

Hypothesis equivalent of QuickCheck frequency generator?

choosing random_State parameter in TSNE (python)

Most suitable clustering method for a dataset containing 10 dimension numerical arrays

Categories

Resources