Passing random variables to sklearn random search (RandomizedSearchCV)

Passing random variables to sklearn random search (RandomizedSearchCV) - python

What is the use of reciprocal() and expon() in the below code?
svm_grid_R = {'kernel':["linear","rbf"], 'C': reciprocal(20,200000), "gamma" : expon(scale=1.0)}
Why can't we just use range()? What range does expon(scale=1.0) and reciprocal(20,200000) signify?
For context the code which uses these parameters is given below:
svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=svm_grid_R,
n_iter=50, cv=5, scoring='neg_mean_squared_error',
verbose=2, random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

I suggest you check the part of your script where the functions are imported in order to figure out what they are. From your question, I infer the following:
reciprocal should be coming from from scipy.stats import reciprocal, which will give you a reciprocal random variable.
expon should be coming from from scipy.stats import expon, which will give you an exponential random variable.
In your code, you are passing these random variables as the C and gamma parameters to the random search. This means that the random parameters used by the search will be sampled from these two distributions.
Technically, you could also use range to tell the search to randomly sample the numbers from a given sequence. Another way to do this is pass the search a random variable from which to sample random parameters. Your code is taking the second approach.
To better understand what the second approach is all about, try the following:
# Import the distribution
from scipy.stats import expon
# Initialize a random variable with lambda=1 (scale=1)
exponential_rv = expon(scale=1)
# Draw a random sample from this distribution
exponential_rv.rvs()
> 0.780028923390962
In this specific case, your search would be passing C=0.780028923390962 to your support vector machine.

Related

Proper Method to Seed Numpy and Multiple Scipy Distributions Concurrently

Currently, I am programming a large simulation that uses random variates from multiple distributions that come from both numpy and scipy.stats, and where the distributions should also be independent. Seeking a way to ensure reproducibility, I luckily stumbled upon Abhinav's response here, where they provide an amazing example. Nevertheless, it notably only seeds a single distribution from scipy, whereas my code has multiple scipy distributions. Is there a way to seed all scipy distributions at once (while still seeding the numpy distributions)? If not all at once, is it possible to seed all of the continuous distributions? (It just seems inefficient to seed every single distribution separately). Thank you very much!
Edit: A minimal reproducible example can be found below (it is similar to Abhinav's example):
from numpy.random import Generator, PCG64
from scipy.stats import binom, norm
n, p, size, seed = 10, 0.5, 10, 12345
numpy_randomGen = Generator(PCG64(seed))
scipy_randomGen = binom
scipy_randomGen2 = norm
numpy_randomGen = Generator(PCG64(seed))
# this is the part I want to simplify, as I have many distributions from scipy
# maybe there is a convention that simplifies it?
scipy_randomGen.random_state=numpy_randomGen
scipy_randomGen2.random_state=numpy_randomGen
print(scipy_randomGen.rvs(n, p, size=size))
print(scipy_randomGen2.rvs(size=size))
print(numpy_randomGen.binomial(n, p, size))

Not sure what you're after. You still need to seed each distribution separately. So there's unlikely to be anything simpler then providing a random_state arg to each rvs call as in
.rvs(n, p, size=size, random_state=...)
Here the random_state argument can be a Generator or an integer seed (however in the latter case it's constructing an old-style RandomState object and seeds it under the hood.

SARIMAX simulation of possible paths

I am trying to create a simulation of possible paths of a stochastic process, which is not anchored to any particular point. E.g. fit SARIMAX model to weather temperature data and then use the model to make a simulation of the temperature.
Here I use the standard demonstration from statsmodels page as a simpler example:
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
import matplotlib.pyplot as plt
from datetime import datetime
import requests
from io import BytesIO
Fitting the model:
wpi1 = requests.get('https://www.stata-press.com/data/r12/wpi1.dta').content
data = pd.read_stata(BytesIO(wpi1))
data.index = data.t
# Set the frequency
data.index.freq="QS-OCT"
# Fit the model
mod = sm.tsa.statespace.SARIMAX(data['wpi'], trend='c', order=(1,1,1))
res = mod.fit(disp=False)
print(res.summary())
Creating simulation:
res.simulate(len(data), repetitions=10).plot();
Here is the history:
Here is the simulation:
The simulated curves are so widely distibuted and apart from each other that this cannot make sense. The initial historical process doesn't have that much of a variance. What do I understand wrongly? How to perform the right simulation?

When you don't pass an initial state, it uses the first predicted state to start the simulation along with its predicted covariance. Since there is no information available to make the first prediction, it uses a diffuse prior with a variance of 1,000,000. This is why you are getting the wide range in your time series. A simple solution is to pass your own initial state using the smoothed_state.
Taking your code above, but using
initial = res.smoothed_state[:, 0]
res.simulate(len(data),
repetitions=10,
initial_state=initial).plot()
I get a plot that looks like
The first value is what really matters in this model, and is 30.6. You could add some randomness here directly by drawing the initial state from another (sensible) distribution. The default distribution is not sensible for simulation since it has a diffuse prior (it is, however, very sensible for estimation).
Other Notes
One other small note: You should not use trend="c" with d=1. You should instead use trend="t" when d=1 so that the model includes a drift. The model you estimate should be
mod = sm.tsa.statespace.SARIMAX(data["wpi"], trend="t", order=(1, 1, 1))
I used this model in the picture above to capture the positive trend in the data.

Running same python code multiple times and getting inconsistent results

I am new to Python, so I am not sure if this problem is due to my inexperience or whether this is a glitch.
I am running this code multiple times on the same data (no random number generation) and getting different results. This has occurred with more than one variable so far, and obviously I cannot proceed with the analysis until I figure out which results are trustworthy. Here is a short sample of the results I have obtained after running the code four times. Why is there such a discrepancy between these outputs? I am puzzled and greatly appreciate your advice.
Linear Regression
from scipy.stats import linregress
import scipy.stats
from scipy.signal import welch
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as signal
part_022_o = pd.read_excel(r'C:\Users\Me\Desktop\Behavioral Data Processed\part_022_combined_other.xlsx')
distance_o = part_022_o["distance"]
fs = 200
f, Pwelch_spec = signal.welch(distance_o, fs=fs, window='hanning',nperseg=400, noverlap=200, scaling='density', average='mean')
log_f = np.log(f, where=f>0)
log_pwelch = np.log(Pwelch_spec, where=Pwelch_spec>0)
idx = np.isfinite(log_f) & np.isfinite(log_pwelch)
polynomial_coefficients = np.polyfit(log_f[idx],log_pwelch[idx],1)
print(polynomial_coefficients)
scipy.stats.linregress(log_f[idx], log_pwelch[idx])
Results First Attempt
[ 0.00324568 -2.82962602]
Results Second Attempt
[-2.70137164 6.97117509]
Results Third Attempt
[-2.70137164 6.97117509]
Results Fourth Attempt
[-2.28028005 5.53839502]
The same thing happens when I use scipy.stats.linregress().
Thank you,
Confused
Edit: full code added.
Also, the issue appears to be related to np.log(), since only the values of "log_f" array seem to be changing with the different outputs. It is hard to be certain that nothing else is changing (e.g. log_pwelch), but differences in output clearly correspond to differences in the first value of the "log_f" array.
Edit: I have narrowed the issue down to np.log(f, where=f>0). The first value in the f array is zero. According to the documentation of numpy log, "...Note that if an uninitialized out array is created via the default out=None, locations within it where the condition is False will remain uninitialized." Apparently this means that the value or variable is unpredictable and can vary from trial to trial, which is exactly what I am observing. Given my inexperience with Python, I am not sure what the best solution is (e.g. specifying the out-array in the log function, use a random seed, just note the regression coefficients whenever the value of zero is unchanged after log, etc.)

Try to use a random seed to reproduce results. Do this with the following code at the top of your program:
import numpy as np
np.random.seed(123) or any number you want
see here for more info: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.seed.html
A random seed ensures you get repeatable results when some part of your program is generating numbers at random.
Try finding out what the functions (np.polyfit(), np.log()) are actually doing using documentation.
This is standard practice for scikit-learn and ML to use a seed value.

Generating gamma random samples in python

can someone help me to generate random numbers from the gamma distribution in python, i have tried these two possibilities but i'm still wondering about the main difference between them :
The first one is :
shape, scale= 0.5,1
size=(1024,10)
np.random.gamma(shape, scale, size)
and the second one is :
from scipy.stats import gamma
gamma.rvs(0.5, 1, (1024,10))
i think both of them are used to generate random samples following the gamma distribution, so what's the difference between these syntaxes. When should we use the first method and when the second one ?

There is no difference between the two except for the fact that one is from bumpy and other from scipy library. The probability density function used to create gamma distribution is same in both the cases.

Goodness of fit test for Weibull distribution in python

I have some data that I have to test to see if it comes from a Weibull distribution with unknown parameters. In R I could use https://cran.r-project.org/web/packages/KScorrect/index.html but I can't find anything in Python.
Using scipy.stats I can fit parameters with:
scipy.stats.weibull_min.fit(values)
However in order to turn this into a test I think I need to perform some Monte-Carlo simulation (e.g. https://en.m.wikipedia.org/wiki/Lilliefors_test) I am not sure what to do exactly.
How can I make such a test in Python?

The Lilliefors test is implemented in OpenTURNS. To do this, all you have to use the Factory which corresponds to the distribution you want to fit.
In the following script, I simulate a Weibull sample with size 10 and perform the Kolmogorov-Smirnov test using a sample size equal to 1000. This means that the KS statistics is simulated 1000 times.
import openturns as ot
sample=ot.WeibullMin().getSample(10)
ot.ResourceMap.SetAsUnsignedInteger("FittingTest-KolmogorovSamplingSize",1000)
distributionFactory = ot.WeibullMinFactory()
dist, result = ot.FittingTest.Kolmogorov(sample, distributionFactory, 0.01)
print('Conclusion=', result.getBinaryQualityMeasure())
print('P-value=', result.getPValue())
More details can be found at:
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_test.html
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_distribution.html

One way around: estimate distribution parameters, draw data from the estimated distribution and run KS test to check that both samples come from the same distribution.
Let's create some "original" data:
>>> values = scipy.stats.weibull_min.rvs( 0.33, size=1000)
Now,
>>> args = scipy.stats.weibull_min.fit(values)
>>> print(args)
(0.32176317627928856, 1.249788665927261e-09, 0.9268793667654682)
>>> scipy.stats.kstest(values, 'weibull_min', args=args, N=100000)
KstestResult(statistic=0.033808945722737016, pvalue=0.19877935361964738)
The last line is equivalent to:
scipy.stats.ks_2samp(values, scipy.stats.weibull_min.rvs(*args, size=100000))
So, once you estimate parameters of the distribution, you can test it pretty reliably. But the scipy estimator is not very good, it took me several runs to get even "close" to the original distribution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Passing random variables to sklearn random search (RandomizedSearchCV) - python

Related

Proper Method to Seed Numpy and Multiple Scipy Distributions Concurrently

SARIMAX simulation of possible paths

Running same python code multiple times and getting inconsistent results

Generating gamma random samples in python

Goodness of fit test for Weibull distribution in python

Categories

Resources