I have two questions:
1- This code takes too long to execute. Any idea how I can make it faster?
With the code bellow I want generate 100 random discrete values between 700 and 1200.
I choosed the weibull distribution because I wanted to generate failure rates data please see the histogram bellow.
import random
nums = []
alpha = 0.6
beta = 0.4
while len(nums) !=100:
temp = int(random.weibullvariate(alpha, beta))
if 700 <= temp <1200:
nums.append(temp)
print(nums)
# plotting a graph
#plt.hist(nums, bins = 200)
#plt.show()
print(nums)
I wanted to generate a histogram like this one:
Histogram
2- I have this function for discrete weibull distribution
def DiscreteWeibull(q, b, x):
return q**(x**b) - q**((x + 1)**b)
How can I generate random values that follow this distribution?
Since the Weibull distribution with shape parameter K and scale parameter lambda can be characterized as this function on the Uniform (0,1) dist. U, we can 'cut' the distribution to a desired minimum and maximum value. We do this by inverting the equation, setting W to 700 or 1200, and finding the values between 0 and 1 that correspond. Here's some sample code.
def weibull_from_uniform(shape, scale, x):
assert 0 <= x <= 1
return scale * pow(-1 * math.log(x), 1.0 / shape)
scale_param = 0.6
shape_param = 0.4
min_value = 700.0
max_value = 1200.0
lower_bound = math.exp(-1 * pow(min_value / scale_param, shape_param))
upper_bound = math.exp(-1 * pow(max_value / scale_param, shape_param))
if lower_bound > upper_bound:
lower_bound, upper_bound = upper_bound, lower_bound
nums = []
while len(nums) < 100:
nums.append(weibull_from_uniform(shape_param, scale_param, random.uniform(lower_bound, upper_bound)))
print(nums)
plt.hist(nums, bins=8)
plt.show()
This code gives a histogram very similar to the one you provided; the method will give values from the same distribution as your original method, just faster. Note that this direct approach only works when our shape parameter K <= 1, so that the density function is strictly decreasing. When K > 1, the Weibull density function increases to a mode, then decreases, so you may need to draw from two uniform intervals for particular min and max values (since inverting for W and U may give two answers).
Your question is not very clear on why you thought using this Weibull distribution was a good idea, nor what distribution you are looking to achieve.
Discrete uniform distribution
Here are two ways to achieve the discrete uniform distribution on [700, 1200).
1) With random
import random
nums = [random.randrange(700, 1200) for _ in range(100)]
2) With numpy
import numpy
nums = numpy.random.randint(700, 1200, 100)
Geometric distribution
You have edited your question with an example histogram, and the mention "I wanted to generate a histogram like this one". The histogram vaguely looks like a geometric distribution.
We can use numpy.random.geometric:
import numpy
n_samples = 100
p = 0.5
a, b = 50, 650
cap = 1200
nums = numpy.random.geometric(p, size = 2 * n_samples) * a + b
nums = nums[numpy.where(nums < cap)][:n_samples]
Related
I originally posted this in physics stack exchange but they requested it be posted here as well....
I am trying to create a known signal with a known wavelength, amplitude, and phase. I then want to break this signal apart into all of its frequencies, find amplitudes, phases, and wavelengths for each frequency, then create equations for each frequency based on these new wavelengths, amplitudes, and phases. In theory, the equations should be identical to the individual signals. However, they are not. I am almost positive it is an issue with phase but I cannot figure out how to resolve it. I will post the exact code to reproduce this below. Please help as my phase, wavelength, and amplitudes will vary once I get more complicated signals so it need to work for any combination of these.
import numpy as np
from matplotlib import pyplot as plt
from scipy import fftpack
# create signal
time_vec = np.arange(1, 11, 1)
wavelength = 1/.1
phase = 0
amp = 10
created_signal = amp * np.sin((2 * np.pi / wavelength * time_vec) + phase)
# plot it
fig, axs = plt.subplots(2, 1, figsize=(10,6))
axs[0].plot(time_vec, created_signal, label='exact_data')
# get fft and freq array
sig_fft = fftpack.fft(created_signal)
sample_freq = fftpack.fftfreq(created_signal.size, d=1)
# do inverse fft and verify same curve as original signal. This is fine!
filtered_signal = fftpack.ifft(sig_fft)
filtered_signal += np.mean(created_signal)
# create individual signals for each frequency
filtered_signals = []
for i in range(len(sample_freq)):
high_freq_fft = sig_fft.copy()
high_freq_fft[np.abs(sample_freq) < np.nanmin(sample_freq[i])] = 0
high_freq_fft[np.abs(sample_freq) > np.nanmax(sample_freq[i])] = 0
filtered_sig = fftpack.ifft(high_freq_fft)
filtered_sig += np.mean(created_signal)
filtered_signals.append(filtered_sig)
# get phase, amplitude, and wavelength for each individual frequency
sig_size = len(created_signal)
wavelength = []
ph = []
amp = []
indices = []
for j in range(len(sample_freq)):
wavelength.append(1 / sample_freq[j])
indices.append(int(sig_size * sample_freq[j]))
for j in indices:
phase = np.arctan2(sig_fft[j].imag, sig_fft[j].real)
ph.append([phase])
amp.append([np.sqrt((sig_fft[j].real * sig_fft[j].real) + (sig_fft[j].imag * sig_fft[j].imag)) / (sig_size / 2)])
# create an equation for each frequency based on each phase, amp, and wavelength found from above.
def eqn(filtered_si, wavelength, time_vec, phase, amp):
return amp * np.sin((2 * np.pi / wavelength * time_vec) + phase)
def find_equations(filtered_signals_mean, high_freq_fft, wavelength, filtered_signals, time_vec, ph, amp):
equations = []
for i in range(len(wavelength)):
temp = eqn(filtered_signals[i], wavelength[i], time_vec, ph[i], amp[i])
equations.append(temp + filtered_signals_mean)
return equations
filtered_signals_mean = np.abs(np.mean(filtered_signals))
equations = find_equations(filtered_signals_mean, sig_fft, wavelength,
filtered_signals, time_vec, ph, amp)
# at this point each equation, for each frequency should match identically each signal from each frequency,
# however, the phase seems wrong and they do not match!!??
axs[0].plot(time_vec, filtered_signal, '--', linewidth=3, label='filtered_sig_combined')
axs[1].plot(time_vec, filtered_signals[1], label='filtered_sig[-1]')
axs[1].plot(time_vec, equations[1], label='equations[-1]')
axs[0].legend()
axs[1].legend()
fig.tight_layout()
plt.show()
These are issues with your code:
filtered_signal = fftpack.ifft(sig_fft)
filtered_signal += np.mean(created_signal)
This only works because np.mean(created_signal) is approximately zero. The IFFT already takes the DC component into account, the zero frequency describes the mean of the signal.
filtered_signals = []
for i in range(len(sample_freq)):
high_freq_fft = sig_fft.copy()
high_freq_fft[np.abs(sample_freq) < np.nanmin(sample_freq[i])] = 0
high_freq_fft[np.abs(sample_freq) > np.nanmax(sample_freq[i])] = 0
filtered_sig = fftpack.ifft(high_freq_fft)
filtered_sig += np.mean(created_signal)
filtered_signals.append(filtered_sig)
Here you are, in the first half of the iterations, going through all the frequencies, taking both the negative and positive frequencies into account. For example, when i=1, you take both the -0.1 and the 0.1 frequencies. The second half of the iterations you are applying the IFFT to a zero signal, none of the np.abs(sample_freq) are smaller than zero by definition.
So the filtered_signals[1] contains a sine wave constructed by both the -0.1 and the 0.1 frequency components. This is good. Otherwise it would be a complex-valued function.
for j in range(len(sample_freq)):
wavelength.append(1 / sample_freq[j])
indices.append(int(sig_size * sample_freq[j]))
Here the second half of the indices array contains negative values. Not sure what you were planning with this, but it causes subsequent code to index from the end of the array.
for j in indices:
phase = np.arctan2(sig_fft[j].imag, sig_fft[j].real)
ph.append([phase])
amp.append([np.sqrt((sig_fft[j].real * sig_fft[j].real) + (sig_fft[j].imag * sig_fft[j].imag)) / (sig_size / 2)])
Here, because the indices are not the same as the j in the previous loop, phase[j] doesn't always correspond to wavelength[j], they refer to values from different frequency components in about half the cases. But those cases we shouldn't be evaluating any way. The code assumes a real-valued input, for which the magnitude and phase of only the positive frequencies is sufficient to reconstruct the signal. You should skip all the negative frequencies here.
Next, you build sine waves using the collected information, but using a time_vec that starts at 1, not at 0 as the FFT assumes. And therefore the signal is shifted with respect to the expected value. Furthermore, when phase==0, you should create an even signal (i.e. a cosine, not a sine).
Thus, changing the following two lines of code will create the correct output:
time_vec = np.arange(0, 10, 1)
and
def eqn(filtered_si, wavelength, time_vec, phase, amp):
return amp * np.cos((2 * np.pi / wavelength * time_vec) + phase)
# ^^^
Note that these two changes corrects the plotted graph, but doesn't correct all the issues in the code discussed above.
I solved this finally after 2 days of frustration. I still have no idea why this is the way it is so any insight would be great. The solution is to use the phase produced by arctan2(Im, Re) and modify it according to this equation.
phase = np.arctan2(sig_fft[j].imag, sig_fft[j].real)
formula = ((((wavelength[j]) / 2) - 2) * np.pi) / wavelength[j]
ph.append([phase + formula])
I had to derive this equation from data but I still do not know why this works. Please let me know. Finally!!
I'm trying to simulate a simple stochastic process in Python, but with no success. The process is the following:
x(t + δt) = r(t) * x(t)
where r(t) is a Bernoulli random variable that can assume the values 1.5 or 0.6.
I've tried the following:
n = 10
r = np.zeros( (1,n))
for i in range(0, n, 1):
if r[1,i] == r[1,0]:
r[1,i] = 1
else:
B = bernoulli.rvs(0.5, size=1)
if B == 0:
r[1,i] = r[1,i-1] * 0.6
else:
r[1,i] = r[1,i-1] * 1.5
Can you explain why this is wrong and a possible solution?
So , first thing is that the SDE should be perceived over time, so you also need to consider the discretization rather than just giving the number of steps through n .
Essentially, what you are asking is just a simple random walk with a Bernoulli random variable taking on the values 0.5 and 1.6 instead of a Gaussian (standard normal) random variable.
So I have created an answer here, using NumPy to create the Bernoulli random variable for efficiency (numpy is faster than scipy) and then running the simulation with a stepsize of 0.01 then plotting the solution using matplotlib.
One thing to note that this SDE is one dimensional so we can just store the state and time in separate vectors and plot them at the end.
# Function generating bernoulli trial (your r(t))
def get_bernoulli(p=0.5):
'''
Function using numpy (faster than scipy.stats)
to generate bernoulli random variable with values 0.5 or 1.6
'''
B = np.random.binomial(1, p, 1)
if B == 0:
return 0.6
else:
return 1.5
This is then used in the simulation as
import numpy as np
import matplotlib.pyplot as plt
dt = 0.01 #step size
x0 = 1# initialize
tfinal = 1
sqrtdt = np.sqrt(dt)
n = int(tfinal/dt)
# State and time vectors
xtraj = np.zeros(n+1, float)
trange = np.linspace(start=0,stop=tfinal ,num=n+1)
# initialized
xtraj[0] = x0
for i in range(n):
xtraj[i+1] = xtraj[i] * get_bernoulli(p=0.5)
plt.plot(trange,xtraj,label=r'$x(t)$')
plt.xlabel("time")
plt.ylabel(r"$X$")
plt.legend()
plt.show()
Where we assumed the Bernoulli trial is fair, but can be customized to add some more variation.
I would like to sample from an arbitrary function in Python.
In Fast arbitrary distribution random sampling it was stated that one could use inverse transform sampling and in Pythonic way to select list elements with different probability it was mentioned that one should use inverse cumulative distribution function. As far as I undestand those methods only work the univariate case. My function is multivariate though and too complex that any of the suggestions in https://stackoverflow.com/a/48676209/4533188 would apply.
Prinliminaries: My function is based on Rosenbrock's banana function, which value we can get the value of the function with
import scipy.optimize
scipy.optimize.rosen([1.1,1.2])
(here [1.1,1.2] is the input vector) from scipy, see https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.optimize.rosen.html.
Here is what I came up with: I make a grid over my area of interest and calculate for each point the function value. Then I sort the resulting data frame by the value and make a cumulative sum. This way we get "slots" which have different sizes - points which have large function values have larger slots than points with small function values. Now we generate random values and look into which slot the random value falls into. The row of the data frame is our final sample.
Here is the code:
import scipy.optimize
from itertools import product
from dfply import *
nb_of_samples = 50
nb_of_grid_points = 30
rosen_data = pd.DataFrame(array([item for item in product(*[linspace(fm[0], fm[1], nb_of_grid_points) for fm in zip([-2,-2], [2,2])])]), columns=['x','y'])
rosen_data['z'] = [np.exp(-scipy.optimize.rosen(row)**2/500) for index, row in rosen_data.iterrows()]
rosen_data = rosen_data >> \
arrange(X.z) >> \
mutate(z_upperbound=cumsum(X.z)) >> \
mutate(z_upperbound=X.z_upperbound/np.max(X.z_upperbound))
value = np.random.sample(1)[0]
def get_rosen_sample(value):
return (rosen_data >> mask(X.z_upperbound >= value) >> select(X.x, X.y)).iloc[0,]
values = pd.DataFrame([get_rosen_sample(s) for s in np.random.sample(nb_of_samples)])
This works well, but I don't think it is very efficient. What would be a more efficient solution to my problem?
I read that Markov chain Monte Carlo might helping, but here I am in over my head for now on how to do this in Python.
I was in a similar situation, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method) to sample from a bivariate distribution. An example follows.
Say, we want to sample from the following denisty:
def density1(z):
z = np.reshape(z, [z.shape[0], 2])
z1, z2 = z[:, 0], z[:, 1]
norm = np.sqrt(z1 ** 2 + z2 ** 2)
exp1 = np.exp(-0.5 * ((z1 - 2) / 0.8) ** 2)
exp2 = np.exp(-0.5 * ((z1 + 2) / 0.8) ** 2)
u = 0.5 * ((norm - 4) / 0.4) ** 2 - np.log(exp1 + exp2)
return np.exp(-u)
which looks like this
The following function implements MH with multivariate normal as the proposal
def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
Run MH and plot samples
samples = metropolis_hastings(density1)
plt.hexbin(samples[:,0], samples[:,1], cmap='rainbow')
plt.gca().set_aspect('equal', adjustable='box')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.show()
Check out this repo of mine for details.
i'm calculating Gini coefficient (similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.random.rand(), the Gini coefficient is 0.3 but I would have expected it to be close to 0 (perfect equality). what is going wrong here?
def G(v):
bins = np.linspace(0., 100., 11)
total = float(np.sum(v))
yvals = []
for b in bins:
bin_vals = v[v <= np.percentile(v, b)]
bin_fraction = (np.sum(bin_vals) / total) * 100.0
yvals.append(bin_fraction)
# perfect equality area
pe_area = np.trapz(bins, x=bins)
# lorenz area
lorenz_area = np.trapz(yvals, x=bins)
gini_val = (pe_area - lorenz_area) / float(pe_area)
return bins, yvals, gini_val
v = np.random.rand(500)
bins, result, gini_val = G(v)
plt.figure()
plt.subplot(2, 1, 1)
plt.plot(bins, result, label="observed")
plt.plot(bins, bins, '--', label="perfect eq.")
plt.xlabel("fraction of population")
plt.ylabel("fraction of wealth")
plt.title("GINI: %.4f" %(gini_val))
plt.legend()
plt.subplot(2, 1, 2)
plt.hist(v, bins=20)
for the given set of numbers, the above code calculates the fraction of the total distribution's values that are in each percentile bin.
the result:
uniform distributions should be near "perfect equality" so the lorenz curve bending is off.
This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expected value (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.
You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500). Those values are all close to 10.5; the relative variation is lower than the sample v = np.random.rand(500).
In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n) is 1/(6*base + 3).
Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.
def gini(x):
# (Warning: This is a concise implementation, but it is O(n**2)
# in time and memory, where n = len(x). *Don't* pass in huge
# samples!)
# Mean absolute difference
mad = np.abs(np.subtract.outer(x, x)).mean()
# Relative mean absolute difference
rmad = mad/np.mean(x)
# Gini coefficient
g = 0.5 * rmad
return g
(For some more efficient implementations, see More efficient weighted Gini coefficient in Python)
Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500):
In [80]: v = np.random.rand(500)
In [81]: gini(v)
Out[81]: 0.32760618249832563
In [82]: v = 1 + np.random.rand(500)
In [83]: gini(v)
Out[83]: 0.11121487509454202
In [84]: v = 10 + np.random.rand(500)
In [85]: gini(v)
Out[85]: 0.01567937753659053
In [86]: v = 100 + np.random.rand(500)
In [87]: gini(v)
Out[87]: 0.0016594595244509495
A slightly faster implementation (using numpy vectorization and only computing each difference once):
def gini_coefficient(x):
"""Compute Gini coefficient of array of values"""
diffsum = 0
for i, xi in enumerate(x[:-1], 1):
diffsum += np.sum(np.abs(xi - x[i:]))
return diffsum / (len(x)**2 * np.mean(x))
Note: x must be a numpy array.
Gini coefficient is the area under the Lorence curve, usually calculated for analyzing the distribution of income in population. https://github.com/oliviaguest/gini provides simple implementation for the same using python.
A quick note on the original methodology:
When calculating Gini coefficients directly from areas under curves with np.traps or another integration method, the first value of the Lorenz curve needs to be 0 so that the area between the origin and the second value is accounted for. The following changes to G(v) fix this:
yvals = [0]
for b in bins[1:]:
I also discussed this issue in this answer, where including the origin in those calculations provides an equivalent answer to using the other methods discussed here (which do not need 0 to be appended).
In short, when calculating Gini coefficients directly using integration, start from the origin. If using the other methods discussed here, then it's not needed.
Note that gini index is currently present in skbio.diversity.alpha as gini_index. It might give a bit different result with examples mentioned above.
You are getting the right answer. The Gini Coefficient of the uniform distribution is not 0 "perfect equality", but (b-a) / (3*(b+a)). In your case, b = 1, and a = 0, so Gini = 1/3.
The only distributions with perfect equality are the Kroneker and the Dirac deltas. Remember that equality means "all the same", not "all equally probable".
There were some issues with the previous implementations. They never gave the gini index = 1 for perfectly sparse data.
example:
def gini_coefficient(x):
"""Compute Gini coefficient of array of values"""
diffsum = 0
for i, xi in enumerate(x[:-1], 1):
diffsum += np.sum(np.abs(xi - x[i:]))
return diffsum / (len(x)**2 * np.mean(x))
gini_coefficient(np.array([0, 0, 1]))
gives the answer 0.666666. That happens because of the implied "integration scheme" it uses.
Here is another variant that bypasses the issue, although it is computationally heavier:
import numpy as np
from scipy.interpolate import interp1d
def gini(v, n_new = 1000):
"""Compute Gini coefficient of array of values"""
v_abs = np.sort(np.abs(v))
cumsum_v = np.cumsum(v_abs)
n = len(v_abs)
vals = np.concatenate([[0], cumsum_v/cumsum_v[-1]])
x = np.linspace(0, 1, n+1)
f = interp1d(x=x, y=vals, kind='previous')
xnew = np.linspace(0, 1, n_new+1)
dx_new = 1/(n_new)
vals_new = f(xnew)
return 1 - 2 * np.trapz(y=vals_new, x=xnew, dx=dx_new)
gini(np.array([0, 0, 1]))
it gives 0.999 output, which is closer to what one wants to have =)
I need to know how to generate 1000 random numbers between 500 and 600 that has a mean = 550 and standard deviation = 30 in python.
import pylab
import random
xrandn = pylab.zeros(1000,float)
for j in range(500,601):
xrandn[j] = pylab.randn()
???????
You are looking for stats.truncnorm:
import scipy.stats as stats
a, b = 500, 600
mu, sigma = 550, 30
dist = stats.truncnorm((a - mu) / sigma, (b - mu) / sigma, loc=mu, scale=sigma)
values = dist.rvs(1000)
There are other choices for your problem too. Wikipedia has a list of continuous distributions with bounded intervals, depending on the distribution you may be able to get your required characteristics with the right parameters. For example, if you want something like "a bounded Gaussian bell" (not truncated) you can pick the (scaled) beta distribution:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
def my_distribution(min_val, max_val, mean, std):
scale = max_val - min_val
location = min_val
# Mean and standard deviation of the unscaled beta distribution
unscaled_mean = (mean - min_val) / scale
unscaled_var = (std / scale) ** 2
# Computation of alpha and beta can be derived from mean and variance formulas
t = unscaled_mean / (1 - unscaled_mean)
beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
alpha = beta * t
# Not all parameters may produce a valid distribution
if alpha <= 0 or beta <= 0:
raise ValueError('Cannot create distribution for the given parameters.')
# Make scaled beta distribution with computed parameters
return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
np.random.seed(100)
min_val = 1.5
max_val = 35
mean = 9.87
std = 3.1
my_dist = my_distribution(min_val, max_val, mean, std)
# Plot distribution PDF
x = np.linspace(min_val, max_val, 100)
plt.plot(x, my_dist.pdf(x))
# Stats
print('mean:', my_dist.mean(), 'std:', my_dist.std())
# Get a large sample to check bounds
sample = my_dist.rvs(size=100000)
print('min:', sample.min(), 'max:', sample.max())
Output:
mean: 9.87 std: 3.100000000000001
min: 1.9290674232087306 max: 25.03903889816994
Probability density function plot:
Note that not every possible combination of bounds, mean and standard deviation will produce a valid distribution in this case, though, and depending on the resulting values of alpha and beta the probability density function may look like an "inverted bell" instead (even though mean and standard deviation would still be correct).
I'm not exactly sure what the OP desired, but if he just wanted an array xrandn fulfilling the bottom plot - below I present the steps:
First, create a standard distribution (Gaussian distribution), the easiest way might be to use numpy:
import numpy as np
random_nums = np.random.normal(loc=550, scale=30, size=1000)
And then you keep only the numbers within the desired range with a list comprehension:
random_nums_filtered = [i for i in random_nums if i>500 and i<600]