I want to create a uint16 image of gaussian noise with a defined mean and standard deviation.
I've tried using numpy's random.normal for this but it returns a float64 array:
mu = 10
sigma = 100
shape = (1024,1024)
gauss_img = np.random.normal(mu, sigma, shape)
print(gauss_img.dtype)
>>> dtype('float64')
Is there a way to convert gauss_img to a uint16 array while preserving the original mean and standard deviation? Or is there another approach entirely to creating a uint16 noise image?
EDIT: As was mentioned in the comments, np.random.normal will inevitably sample negative values given a sd > mean, which is a problem for converting to uint16.
So I think I need a different method that will create an unsigned gaussian image directly.
So I think this is close to what you're looking for.
Import libraries and spoof some skewed data. Here, since the input is of unknown origin, I created skewed data using np.expm1(np.random.normal()). You could use skewnorm().rvs() as well, but that's kind of cheating since that's also the lib you'll use to characterize it.
I flatten the raw samples to make plotting histograms easier.
import numpy as np
from scipy.stats import skewnorm
# generate dummy raw starting data
# smaller shape just for simplicity
shape = (100, 100)
raw_skewed = np.maximum(0.0, np.expm1(np.random.normal(2, 0.75, shape))).astype('uint16')
# flatten to look at histograms and compare distributions
raw_skewed = raw_skewed.reshape((-1))
Now find the params that characterize your skewed data, and use those to create a new distribution to sample from that hopefully matches your original data well.
These two lines of code are really what you're after I think.
# find params
a, loc, scale = skewnorm.fit(raw_skewed)
# mimick orig distribution with skewnorm
new_samples = skewnorm(a, loc, scale).rvs(10000).astype('uint16')
Now plot the distributions of each to compare.
plt.hist(raw_skewed, bins=np.linspace(0, 60, 30), hatch='\\', label='raw skewed')
plt.hist(new_samples, bins=np.linspace(0, 60, 30), alpha=0.65, color='green', label='mimic skewed dist')
plt.legend()
The histograms are pretty close. If that looks good enough, reshape your new data to the desired shape.
# final result
new_samples.reshape(shape)
Now... here's where I think it probably falls short. Take a look at the heatmap of each. The original distribution had a longer tail to the right (more outliers that skewnorm() didn't characterize).
This plots a heatmap of each.
# plot heatmaps of each
fig = plt.figure(2, figsize=(18,9))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
im1 = ax1.imshow(raw_skewed.reshape(shape), vmin=0, vmax=120)
ax1.set_title("raw data - mean: {:3.2f}, std dev: {:3.2f}".format(np.mean(raw_skewed), np.std(raw_skewed)), fontsize=20)
im2 = ax2.imshow(new_samples.reshape(shape), vmin=0, vmax=120)
ax2.set_title("mimicked data - mean: {:3.2f}, std dev: {:3.2f}".format(np.mean(new_samples), np.std(new_samples)), fontsize=20)
plt.tight_layout()
# add colorbar
fig.subplots_adjust(right=0.85)
cbar_ax = fig.add_axes([0.88, 0.1, 0.08, 0.8]) # [left, bottom, width, height]
fig.colorbar(im1, cax=cbar_ax)
Looking at it... you can see occasional flecks of yellow indicating very high values in the original distribution that didn't make it into the output. This also shows up in the higher std dev of the input data (see titles in each heatmap, but again, as in comments to original question... mean & std don't really characterize the distributions since they're not normal... but they're in as a relative comparison).
But... that's just the problem it has with the very specific skewed sample i created to get started. There's hopefully enough here to mess around with and tune until it suits your needs and your specific dataset. Good luck!
With that mean and sigma you are bound to sample some negative values. So i guess the option could be that you find the most negative value, after sampling, and add its absolute value to all the samples. After that convert to uint as suggested in the comments. But ofcourse you loose the mean this way.
If you have a range of uint16 numbers to sample from, then you should check out this post.
This way you could use scipy.stats.truncnorm to generate a gaussian of unsigned integers.
Related
I have a simulated signal which is displayed as an histogram. I want to emulate the real measured signal using a convolution with a Gaussian with a specific width, since in the real experiment a detector has a certain uncertainty in the measured channels.
I have tried to do a convolution using np.convolve as well as scipy.signal.convolve but can't seem to get the filtering correctly. Not only the expected shape is off, which would be a slightly smeared version of the histogram and the x-axis e.g. energy scale is off aswell.
I tried defining my Gaussian with a width of 20 keV as:
gauss = np.random.normal(0, 20000, len(coincidence['esum']))
hist_gauss = plt.hist(gauss, bins=100)[0]
where len(coincidence['esum']) is the length of my coincidencedataframe column.This column I bin using:
counts = plt.hist(coincidence['esum'], bins=100)[0]
Besides this approach to generate a suitable Gaussian I tried scipy.signal.gaussian(50, 30000) which unfortunately generates a parabolic looking curve and does not exhibit the characteristic tails.
I tried doing the convolution using both coincidence['esum'] and counts with the both Gaussian approaches. Note that when doing a simple convolution with the standard example according to Finding the convolution of two histograms it works without problems.
Would anyone know how to do such a convolution in python? I exported the column of coincidende['esum'] that I use for my histogram to a pastebin, in case anyone is interested and wants to recreate it with the specific data https://pastebin.com/WFiSBFa6
As you may be aware, doing the convolution of the two histograms with the same bin size will give the histogram of the result of adding each element of one of the samples with each elements of the other of the samples.
I cannot see exactly what you are doing. One important thing that you seem to not be doing is to make sure that the bins of the histograms have the same width, and you have to take care of the position of the edges of the second bin.
In code we have
def hist_of_addition(A, B, bins=10, plot=False):
A_heights, A_edges = np.histogram(A, bins=bins)
# make sure the histogram is equally spaced
assert(np.allclose(np.diff(A_edges), A_edges[1] - A_edges[0]))
# make sure to use the same interval
step = A_edges[1] - A_edges[0]
# specify parameters to make sure the histogram of B will
# have the same bin size as the histogram of A
nBbin = int(np.ceil((np.max(B) - np.min(B))/step))
left = np.min(B)
B_heights, B_edges = np.histogram(B, range=(left, left + step * nBbin), bins=nBbin)
# check that the bins for the second histogram matches the first
assert(np.allclose(np.diff(B_edges), step))
C_heights = np.convolve(A_heights, B_heights)
C_edges = B_edges[0] + A_edges[0] + np.arange(0, len(C_heights) + 1) * step
if plot:
plt.figure(figsize=(12, 4))
plt.subplot(131)
plt.bar(A_edges[:-1], A_heights, step)
plt.title('A')
plt.subplot(132)
plt.bar(B_edges[:-1], B_heights, step)
plt.title('B')
plt.subplot(133)
plt.bar(C_edges[:-1], C_heights, step)
plt.title('A+B')
return C_edges, C_heights
Then
A = -np.cos(np.random.rand(10**6))
B = np.random.normal(1.5, 0.025, 10**5)
hist_of_addition(A, B, bins=100, plot=True);
Gives
I am trying to find the peaks in some very noisy data such as this:
Without understanding the terminology very well, I'm defining the peaks as narrow (width<30) and more than 100000 higher than the nearby area.
I'm trying to use scipy's find_peaks_cwt but the documentation is pretty unclear to me. I tried find_peaks_cwt(my_data, np.arange(1,30)) but it returned a huge number of peaks. Then I tried adding the noise_perc=60 argument but this didn't really fix the problem. I've also tried playing around with the other parameters but I don't really understand what a 'ridge line' is.
What should I be doing differently? Is widths=np.arange(1,30) setting my width requirement like I think it is? How do I specify a height requirement?
A lot depends on what your data actually mean (or what you think they ought to mean). Here's an example with synthetic data:
from scipy.signal import find_peaks_cwt
from matplotlib.pyplot import plot, ylim
from numpy import *
N = 2000
x = arange(N)
pwid = 200.
zideal = sinc(x/pwid - 2)**2 # Vaguely similar to yours
z = zideal * random.randn(N)**2 # adding noise
plot(x, zideal, lw=4)
ylim(0, 1)
zf = find_peaks_cwt(z, pwid/4+zeros(N))
plot(x[zf], zideal[zf], '*', ms=20, color='green')
# Create averaging zones around peaks
xlow = maximum(array(zf) - pwid/2, 0)
xhigh = minimum(array(zf) + pwid/2, x.max())
zguess = 0*xlow # allocate space
for ii in range(len(zf)):
zguess[ii] = z[xlow[ii]:xhigh[ii]].mean()
plot(x[zf], zguess, 'o', ms=10, color='red')
pwidth scales the width of the peaks in the sinc() function. In the call to find_peaks_cwt(), using larger values for widths produces fewer peaks (lower density of peaks). The best result seems to be scaling values in widths to approximately half-width at half-max (HWHM) of the peaks.
find_peaks_cwt() does a pretty respectable job of finding the peaks from the ideal data. Summing around the values is a way to guess at the peak values. If you're summing spectral power, you should probably sum all the values in between rather than a fixed interval like I did for this quick-and-dirty demo.
I find the function especially impressive in the way it finds smaller peaks in the presence of much larger ones.
My data--a 196,585-record numpy array extracted from a pandas dataframe--are being placed into a single bin by matplotlib.hist. The data were originally integers, so I tried converting them to float as wel, as shown below, but they are still not being distributed among 10 bins.
Interestingly, a small sub-sample (using df.sample(0.00x)) of the integer data are successfully distributed.
Any suggestions on where I may be erring in data preparation or use of matplotlib's histogram function would be appreciated.
x = df[(df['UNIT']=='X')].OPP_VALUE.values
num_bins = 10
n, bins, patches = plt.hist((x[(x>0)]).astype(float), num_bins, normed=False, facecolor='0.5', alpha=0.8)
plt.show()
Most likely what is happening is that the number of data points with x > 0.5 is very small but you do have some outliers that forces the hist function to pick the scale it does. Try removing all values > 0.5 (or 1 if you do not want to convert to float) and then plot again.
you should modify number of bins, for exam
number_of_bins = 200
bin_cutoffs = np.linspace(np.percentile(x,0), np.percentile(x,99),number_of_bins)
I'm trying to get a nice upsampler using Python when I have non-uniform spaced inputs. Any suggestions would be helpful. I've tried a number of interp functions. Here's an example:
from scipy.interpolate import InterpolatedUnivariateSpline
from numpy import linspace, arange, append
from matplotlib.pyplot import plot
F=[0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M=[0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
ff=linspace(F[0],F[1],10)
for i in arange(2, len(F)):
ff=append(ff,linspace(F[i-1],F[i], 10))
aa=InterpolatedUnivariateSpline(x=F,y=M,k=2);
mm=aa(ff)
plot(F,M,'r-o'); plot(ff,mm,'bo'); show()
This is the plot I get:
I need to get interpolated values that don't go below 0. Note that the blue dots go below zero. The red line represents the original F vs. M data. If I use k=1 (piece-wise linear interp) then I get good values as shown here:
aa=InterpolatedUnivariateSpline(x=F,y=M,k=1)
mm=aa(ff); plot(F,M,'r-o');plot(ff,mm,'bo'); show()
The problem is that I need to have a "smooth" interpolation and not the piece-wise value. Does anyone know if the bbox argument in InterpolatedUnivarientSpline helps to fix that? I cant find any documentation on what bbox does. Is there another easier way to accomplish this?
Thanks in advance for any help.
Positivity-preserving interpolation is hard (if it wasn't, there wouldn't be a bunch of papers written about it). The splines of low degree (2, 3) usually do pretty well in this regard, but your data has that large gap in it, and it happens to be at the end of data range, making things worse.
One solution is to do interpolation in two steps: first upsample the data by piecewise linear interpolation, then interpolate new data with a smooth spline (I'll use cubic spline below, though quadratic also works).
The gap_size array records how large each gap is, relative to the smallest one. In subsequent loop, uniformly spaced points are replaced in large gaps (those that are at least twice the size of smallest one). The result is F_new, a nearly-uniform better grid that still includes the original points. The corresponding M values for it are generated by a piecewise linear spline.
Subsequent cubic interpolation produces a smooth curve that stays positive.
F = [0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M = [0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
gap_size = np.diff(F) // np.diff(F).min()
F_new = []
for i in range(len(F)-1):
F_new.extend(np.linspace(F[i], F[i+1], gap_size[i], endpoint=False))
F_new.append(F[-1])
pl_spline = InterpolatedUnivariateSpline(F, M, k=1);
M_new = pl_spline(F_new)
smooth_spline = InterpolatedUnivariateSpline(F_new, M_new, k=3)
ff = np.linspace(F[0], F[-1], 100)
plt.plot(F, M, 'ro')
plt.plot(ff, smooth_spline(ff), 'b')
plt.show()
Of course, no tricks can hide the truth that we don't know what happens between 5500 and 22050 (Hz, I presume), the nearly-linear part is just a placeholder.
I am trying to create an array of random numbers using Numpy's random exponential distribution. I've got this working fine, however I have one extra requirement for my project and that is the ability to specify precisely how many array elements have a certain value.
Let me explain (code is below, but I'll have a go at explaining it here): I generate my random exponential distribution and plot a histogram of the data, producing a nice exponential curve. What I really want to be able to do is use a variable to specify the y-intercept of this curve (point where curve meets the y-axis). I can achieve this in a basic way by changing the number of bins in my histogram, but this only changes the plot and not the original data.
I have inserted the bones of my code here. To give some context, I am trying to create the exponential disc of a galaxy, hence the random array I want to generate is an array of radii and the variable I want to be able to specify is the number density in the centre of the galaxy:
import numpy as N
import matplotlib.pyplot as P
n = 1000
scale_radius = 2
central_surface_density = 100 #I would like this to be the controlling variable, even if it's specification had knock on effects on n.
radius_array = N.random.exponential(scale_radius,(n,1))
P.figure()
nbins = 100
number_density, radii = N.histogram(radius_array, bins=nbins,normed=False)
P.plot(radii[0:-1], number_density)
P.xlabel('$R$')
P.ylabel(r'$\Sigma$')
P.ylim(0, central_surface_density)
P.legend()
P.show()
This code creates the following histogram:
So, to summarise, I would like to be able to specify where this plot intercepts the y-axis by controlling how I've generated the data, not by changing how the histogram has been plotted.
Any help or requests for further clarification would be very much appreciated.
According to the docs for numpy.random.exponential, the input parameter beta, is 1/lambda for the definition of the exponential described in wikipedia.
What you want is this function evaluated at f(x=0)=lambda=1/beta. Therefore in a normed distribution, your y-intercept should just be the inverse of the numpy function:
import numpy as np
import pylab as plt
target = 250
beta = 1.0/target
Y = np.random.exponential(beta, 5000)
plt.hist(Y, normed=True, bins=200,lw=0,alpha=.8)
plt.plot([0,max(Y)],[target,target],'r--')
plt.ylim(0,target*1.1)
plt.show()
Yes the y-intercept of the histogram will change with different bin sizes, but this doesn't mean anything. The only thing that you can reasonably talk about here is the underlying probability distribution (hence the normed=true)