Numpy: how to generate a random noisy curve resembling a "training curve" - python

I'd like to know how I can generate some random data whose plot resembles a "training curve." By training curve, I mean an array of training loss values from a learning model. These typically have larger values and variance at the beginning, and over time converge to some value with very little variance. It looks a bit like a noisy exponential curve.
This is the closest I've gotten to making random data that resembles a training curve. The problems are that the curve does not flatten out or converge like true loss curves, and there is too much variance on the flatter part.
import numpy as np
import matplotlib.pyplot as plt
num_iters = 2000
rand_curve = np.sort(np.random.exponential(size=num_iters))[::-1]
noise = np.random.normal(0, 0.2, num_iters)
signal = rand_curve + noise
noisy_curve = signal[signal > 0]
plt.plot(noisy_curve, c='r', label='random curve')
And here is an actual training loss curve for reference.
I do not know enough about probability distributions to know if this is a stupid question. I only wanted to generate a random curve so that others had a data array to work with to help me with another question I have about logarithmic plots in matplotlib.

Here is the illustration how to do it with gamma distribution for the noise
x = np.arange(2000)
y = 0.00025 + 0.001 * np.exp(-x/100.) + scipy.stats.gamma(3).rvs(len(x))*(1-np.exp(-x/100))*2e-5
You can adjust the parameters here, to reduce the amount of noise etc

Seems like you could add a dampener to the noise value that is proportional to how far along the x axis that given value is. This would mean, in this case, the variance would decrease the flatter the curve got. Something like:
import numpy as np
import matplotlib.pyplot as plt
num_iters = 2000
rand_curve = np.sort(np.random.exponential(size=num_iters))[::-1]
noise = np.random.normal(0, 0.2, num_iters)
index = 0
for noise_value in np.nditer(noise):
noise[index] = noise_value - index
index = index + 1
signal = rand_curve + noise
noisy_curve = signal[signal > 0]
plt.plot(noisy_curve, c='r', label='random curve')
Thus I think the noise values should be lower the further along X you go and it should achieve the result you want!

Related

I want to fit my histogram with a curve but don't know what to do

I want a normal curve to fit the histogram I already have.
navf2 is a list of normalized random numbers and the histogram is based on those, and I want a curve to show the general trend of the histogram.
while len(navf2)<252:
number=np.random.normal(0,1,None)
navf2.append(number)
bin_edges=np.arange(70,130,1)
plt.style.use(["dark_background",'ggplot'])
plt.hist(navf2, bins=bin_edges, alpha=1)
plt.ylabel("Frequency of final NAV")
plt.xlabel("Ranges")
ymin=0
ymax=100
plt.ylim([ymin,ymax])
plt.show()
Here You go:
=^..^=
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
# create raw data
data = np.random.uniform(size=252)
# distribution fitting
mu, sigma = norm.fit(data)
# fitting distribution
x = np.linspace(-0.5,1.5,100)
y = norm.pdf(x, loc=mu, scale=sigma)
# plot data
plt.plot(x, y,'r-')
plt.hist(data, density=1, alpha=1)
plt.show()
Output:
Here is a another solution using your code as mentioned in the question. We can achieve the expected result without the use of the scipy library. we will have to do three things, compute the mean of the data set, compute the standard deviation of the set, and create a function that generates the normal or Gaussian curve.
To compute the mean we can use the function within numpy library, ie mu = np.mean(your_data_set_here)
The standard deviation of the set is the square root of the sum of the differences of the values and mean squared https://en.wikipedia.org/wiki/Standard_deviation. We can express it in code as follows, using the numpy library again:
data_set = [] # some data set
sigma = np.sqrt(1/(len(data_set))*sum((data_set-mu)**2))
Finally we have to build the function for the normal curve or Gaussian https://en.wikipedia.org/wiki/Gaussian_function, it relies on both the mean (mu) and the standard deviation (sigma), so we will use those as parameters in our function:
def Gaussian(x,sigma,mu): # sigma is the standard deviation and mu is the mean
return ((1/(np.sqrt(2*np.pi)*sigma))*np.exp(-(x-mu)**2/(2*sigma**2)))
putting it all together looks like this:
import numpy as np
import matplotlib.pyplot as plt
navf2 = []
while len(navf2)<252:
number=np.random.normal(0,1,None) # since all values will be between 0,1 the bin size doesnt work
navf2.append(number)
navf2 = np.asarray(navf2) # convert to array for better results
mu = np.mean(navf2) #the avg of all values in navf2
sigma = np.sqrt(1/(len(navf2))*sum((navf2-mu)**2)) # standard deviation of navf2
x_vals = np.arange(min(navf2),max(navf2),0.001) # create a flat range based off data
# to build the curve
gauss = [] #store values for normal curve here
def Gaussian(x,sigma,mu): # defining the normal curve
return ((1/(np.sqrt(2*np.pi)*sigma))*np.exp(-(x-mu)**2/(2*sigma**2)))
for val in x_vals :
gauss.append(Gaussian(val,sigma,mu))
plt.style.use(["dark_background",'ggplot'])
plt.hist(navf2, density = 1, alpha=1) # add density = 1 to fix the scaling issues
plt.ylabel("Frequency of final NAV")
plt.xlabel("Ranges")
plt.plot(x_vals,gauss)
plt.show()
Here is a picture of an output:
Hope this helps, I tired to keep it as close to your original code as possible !

Sampling distribution Normal Approximation Misfit

I was trying to simulate "Sampling Distribution of Sample Proportions" using Python. I tried with a Bernoulli Variable as in example here
The crux is that, out of large number of gumballs, we have yellow balls with true proportion of 0.6. If we take samples (of some size, say 10), take mean of that and plot, we should get a normal distribution.
I have managed to obtain the sampling distribution as normal, however, the actual normal continuous curve with same mu and sigma, does not fit at all, but scaled to few factors up. I am not sure what is causing this, ideally it should fit perfectly. Below is my code and output. I tried varying the amplitude and also sigma (dividing by sqrt(samplesize)) but nothing helped. Kindly help.
Code:
from SDSP import create_bernoulli_population, get_frequency_df
from random import shuffle, choices
from bi_to_nor_demo import get_metrics, bare_minimal_plot
import matplotlib.pyplot as plt
N = 10000 # 10000 balls
p = 0.6 # probability of yellow ball is 0.6, and others (1-0.6)=>0.4
n_pickups = 10 # sample size
n_experiments = 2000 # I dont know what this is called
# STATISTICAL PDF
# choose sample, take mean and add to X_mean_list. Do this for n_experiments times.
X_hat = []
X_mean_list = []
for each_experiment in range(n_experiments):
X_hat = choices(population, k=n_pickups) # choose, say 10 samples from population (with replacement)
X_mean = sum(X_hat)/len(X_hat)
X_mean_list.append(X_mean)
stats_df = get_frequency_df(X_mean_list)
# plot both theoretical and statistical outcomes
fig, ax = plt.subplots(1,1, figsize=(5,5))
from SDSP import plot_pdf
mu,var,sigma = get_metrics(stats_df)
plot_pdf(stats_df, ax, n_pickups, mu, sigma, p=mu, bar_width=round(0.5/n_pickups,3),
title='Sampling Distribution of\n a Sample Proportion')
plt.tight_layout()
plt.show()
Output:
Red curve is the misfit normal approximation curve. The mu and sigma is derived from statistical discrete distribution (small blue bars), and fed to formula calculating normal curve. But normal curve looks scaled up somehow.
Update:
Avoiding a division to take average, solves the graph issue but mu is scaled. So issue is still not fully solved yet. :(
X_mean = sum(X_hat) # removed the division /len(X_hat)
Output after removing above division (but its needed?):

sklearn LogisticRegression - plot displays too small coefficient

I am attempting to fit a logistic regression model to sklearn's iris dataset. I get a probability curve that looks like it is too flat, aka the coefficient is too small. I would expect a probability over ninety percent by sepal length > 7 :
Is this probability curve indeed wrong? If so, what might cause that in my code?
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math
from sklearn.linear_model import LogisticRegression
data = datasets.load_iris()
#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]
#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]
#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)
#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression, you will find a regularization parameter C that can be passed as argument while training the logistic regression model.
C : float, default: 1.0 Inverse of regularization strength; must be a
positive float. Like in support vector machines, smaller values
specify stronger regularization.
Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit, while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.
You can always use techniques like cross-validation to choose the C value that is right for you. The following code / figure shows the probability curve fitted with models of different complexity (i.e., with different values of the regularization parameter C, from 1 to 10):
x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))
C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)):
lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
lgs.fit(lengths, is_setosa)
y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
plt.plot(x_values, y_values, label=labels[i])
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()
Predicted probs with models fitted with different values of C
Although you do not describe what you want to plot, I assume you want to plot the separating line. It seems that you are confused with respect to the Logistic/sigmoid function. The decision function of Logistic Regression is a line.
Your probability graph looks flat because you have, in a sense, "zoomed in" too much.
If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph)
Please note that the value's we are talking about are the results of -(m*x+b)
When we reduce the limits of your graph, say by using
x_values = np.linspace(4, 7, 100), we get something which looks like a line:
But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100), we get the clearer sigmoid:

Plotting only one side of gaussian in Python using matplotlib and scipy

I have a set of points in the first quadrant that look like a gaussian, and I am trying to fit it using a gaussian in python and my code is as follows:
import pylab as plb
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
import math
x=ar([37,69,157,238,274,319,391,495,533,626,1366,1855,2821,3615,4130,4374,6453,6863,7021,
7951,8646,9656,10464,11400])
y=ar([1.77,1.67,1.65,1.17,1.34,1.46,0.75,1,0.8,1.02,0.65,0.69,0.44,0.44,0.55,0.43,0.75,0.27,0.26,
0.44,0.04,0.44,0.26,0.04])
n = 24 #the number of data
mean = sum(x*y)/n #note this correction
sigma = math.sqrt(sum(y*(x-mean)**2)/n) #note this correction
def gaus(x,a,x0,sigma):
return a*exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=None, sigma=None) #'''p0=[1,mean,sigma]'''
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit for Time Constant')
plt.xlabel('Time (s)')
plt.ylabel('Voltage (V)')
plt.show()
And the output is: this figure:
http://s2.postimg.org/wevggkc95/Workspace_1_022.png
Why are all the red points coming below, Also note that I am interested in a half gaussian as my data is like that, so my y values are big at first and then decreasing like one side of the gaussian bell. Can anyone tell me how to fit this curve in python, (in case it cannot be fit to gaussian). Or in other words, I want code to fit the half(left side) gaussian of my points (in the first quadrant only). Note that my points cannot be fit as an exponentially decreasing curve as I tried that earlier, and it is not fitting well at lower 'x' values.
Apparently your data do not fit well or easily to a Gaussian function. You use the default initial guesses for p0 = [1,1,1] which is so far away from any kind of optimal choice that curve_fit gives up before it gets started (check the values of popt=[1,1,1] and pcov=[inf, inf, inf]). You could try with better guesses (e.g. p0 = [2,0, 2000]), but on my system it won't converge: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
To fit a "half-Gaussian", don't float the centre position x0 (just leave it equal to 0):
def gaus(x,a,sigma):
return a*exp(-(x)**2/(2*sigma**2))
p0 = [1.2, 4000]
popt,pcov = curve_fit(gaus,x,y,p0=p0)
Unless you have a particular reason for wanting to fit a Gaussian, why not do a more robust linear least squares fit to a polynomial, e.g.:
pfit = np.polyfit(x, y, 3)
poly = np.poly1d(pfit)

gaussian sum filter for irregular spaced points

I have a set of points (x,y) as two vectors
x,y for example:
from pylab import *
x = sorted(random(30))
y = random(30)
plot(x,y, 'o-')
Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for:
x_eval = linspace(0,1,11)
I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance.
As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d.
Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data.
Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # for repeatability
x = np.random.rand(30)
x.sort()
y = np.random.rand(30)
x_eval = np.linspace(0, 1, 11)
sigma = 0.1
delta_x = x_eval[:, None] - x
weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma)
weights /= np.sum(weights, axis=1, keepdims=True)
y_eval = np.dot(weights, y)
plt.plot(x, y, 'bo-')
plt.plot(x_eval, y_eval, 'ro-')
plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question...
...that being said there, there is a simple two step solution to your problem.
Step 1: Resample the data
So to illustrate this we can create a random data set with unequal sampling:
import numpy as np
x = np.cumsum(np.random.randint(0,100,100))
y = np.random.normal(0,1,size=100)
This gives something like:
We can resample this data using simple linear interpolation:
nx = np.arange(x.max()) # choose new x axis sampling
ny = np.interp(nx,x,y) # generate y values for each x
This converts our data to:
Step 2: Apply filter
At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value:
import scipy.ndimage.filters as filters
fx = filters.gaussian_filter1d(ny,sigma=100)
Plotting this up against the original data we get:
The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints.
I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet.
def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6):
"""Apply gaussian sum filter to data.
xdata, ydata : array
Arrays of x- and y-coordinates of data.
Must be 1d and have the same length.
xeval : array
Array of x-coordinates at which to evaluate the smoothed result
sigma : float
Standard deviation of the Gaussian to apply to each data point
Larger values yield a smoother curve.
null_thresh : float
For evaluation points far from data points, the estimate will be
based on very little data. If the total weight is below this threshold,
return np.nan at this location. Zero means always return an estimate.
The default of 0.6 corresponds to approximately one sigma away
from the nearest datapoint.
"""
# Distance between every combination of xdata and xeval
# each row corresponds to a value in xeval
# each col corresponds to a value in xdata
delta_x = xeval[:, None] - xdata
# Calculate weight of every value in delta_x using Gaussian
# Maximum weight is 1.0 where delta_x is 0
weights = np.exp(-0.5 * ((delta_x / sigma) ** 2))
# Multiply each weight by every data point, and sum over data points
smoothed = np.dot(weights, ydata)
# Nullify the result when the total weight is below threshold
# This happens at evaluation points far from any data
# 1-sigma away from a data point has a weight of ~0.6
nan_mask = weights.sum(1) < null_thresh
smoothed[nan_mask] = np.nan
# Normalize by dividing by the total weight at each evaluation point
# Nullification above avoids divide by zero warning shere
smoothed = smoothed / weights.sum(1)
return smoothed

Categories

Resources