Probability density function for a set of values using numpy

Probability density function for a set of values using numpy - python

Below is the data for which I want to plot the PDF.
https://gist.github.com/ecenm/cbbdcea724e199dc60fe4a38b7791eb8#file-64_general-out
Below is the script
import numpy as np
import matplotlib.pyplot as plt
import pylab
data = np.loadtxt('64_general.out')
H,X1 = np.histogram( data, bins = 10, normed = True, density = True) # Is this the right way to get the PDF ?
plt.xlabel('Latency')
plt.ylabel('PDF')
plt.title('PDF of latency values')
plt.plot(X1[1:], H)
plt.show()
When I plot the above, I get the following.
Is the above the correct way to calculate the PDF of a range of values
Is there any other way to confirm that the results I get is the actual PDF. For example, how can show the area under pdf = 1 for my case.

It is a legit way of approximating the PDF. Since np.histogram uses various techniques for binning the values you won't get the exact frequency of each number in your input. For a more exact approximation you should count the occurrence of each number and divide it by the total count. Also, since these are discrete values, the plot could be plotted as points or bars to give a more correct impression.
In the discrete case, the sum of the frequencies should equal 1. In the continuous case you can for example use np.trapz() to approximate the integral.

Related

Understanding the output of fftfreq function and the fft plot for a single row in an image

I am trying to understand the function fftfreq and the resulting plot generated by adding real and imaginary components for one row in the image. Here is what I did:
import numpy as np
import cv2
import matplotlib.pyplot as plt
image = cv2.imread("images/construction_150_200_background.png", 0)
image_fft = np.fft.fft(image)
real = image_fft.real
imag = image_fft.imag
real_row_bw = image_fft[np.ceil(image.shape[0]/2).astype(np.int),0:image.shape[1]]
imag_row_bw = image_fft[np.ceil(image.shape[0]/2).astype(np.int),0:image.shape[1]]
sum = real_row_bw + imag_row_bw
plt.plot(np.fft.fftfreq(image.shape[1]), sum)
plt.show()
Here is image of the plot generated :
I read the image from the disk, calculate the Fourier transform and extract the real and imaginary parts. Then I sum the sine and cosine components and plot using the pyplot library.
Could someone please help me understand the fftfreq function? Also what does the peak represent in the plot for the following image:
I understand that Fourier transform maps the image from spatial domain to the frequency domain but I cannot make much sense from the graph.
Note: I am unable to upload the images directly here, as at the moment of asking the question, I am getting an upload error.

I don't think that you really need fftfreq to look for frequency-domain information in images, but I'll try to explain it anyway.
fftfreq is used to calculate the frequencies that correspond to each bin in an FFT that you calculate. You are using fftfreq to define the x coordinates on your graph.
fftfreq has two arguments: one mandatory, one optional. The mandatory first argument is an integer, the window length you used to calculate an FFT. You will have the same number of frequency bins in the FFT as you had samples in the window. The optional second argument is the time period per window. If you don't specify it, the default is a period of 1. I don't know whether a sample rate is a meaningful quantity for an image, so I can understand you not specifying it. Maybe you want to give the period in pixels? It's up to you.
Your FFT's frequency bins start at the negative Nyquist frequency, which is half the sample rate (default = -0.5), or a little higher; and it ends at the positive Nyquist frequency (+0.5), or a little lower.
The fftfreq function returns the frequencies in a funny order though. The zero frequency is always the zeroth element. The frequencies count up to the maximum positive frequency, and then flip to the maximum negative frequency and count upwards towards zero. The reason for this strange ordering is that if you're doing FFT's with real-valued data (you are, image pixels do not have complex values), the negative frequency data is exactly equal to the corresponding positive frequency data and is redundant. This ordering makes it easy to throw the negative frequencies away: just take the first half of the array. Since you aren't doing that, you're plotting the negative frequencies too. If you should choose to ignore the second half of the array, the negative frequencies will be removed.
As for the strong spike that you see at the zero frequency in your image, this is probably because your image data is RGB values which range from 0 to 255. There's a huge "DC offset" in your data. It looks like you're using Matplotlib. If you are plotting in an interactive window, you can use the zoom rectangle to look at that horizontal line. If you push the DC offset off scale, setting the Y axis scale to perhaps ±500, I bet you will start to see that the horizontal line isn't exactly horizontal after all.
Once you know which bin contains your DC offset, if you don't want to see it, you can just assign the value of the fft in that bin to zero. Then the graph will scale automatically.
By the way, these two lines of code perform identical calculations, so you aren't actually taking the sine and cosine components like your text says:
real_row_bw = image_fft[np.ceil(image.shape[0]/2).astype(np.int),0:image.shape[1]]
imag_row_bw = image_fft[np.ceil(image.shape[0]/2).astype(np.int),0:image.shape[1]]
And one last thing: to sum the sine and cosine components properly (once you have them), since they're at right angles, you need to use a vector sum rather than a scalar sum. Look at the function numpy.linalg.norm.

Is there a way to plot Matplotlib's Imshow against a specific array rather than the indices?

I'm trying to use Imshow to plot a 2-d Fourier transform of my data. However, Imshow plots the data against its index in the array. I would like to plot the data against a set of arrays I have containing the corresponding frequency values (one array for each dim), but can't figure out how.
I have a 2D array of data (gaussian pulse signal) that I Fourier transform with np.fft.fft2. This all works fine. I then get the corresponding frequency bins for each dimension with np.fft.fftfreq(len(data))*sampling_rate. I can't figure out how to use imshow to plot the data against these frequencies though. The 1D equivalent of what I'm trying to do us using plt.plot(x,y) rather than just using plt.plot(y).
My first attempt was to use imshows "extent" flag, but as fas as I can tell that just changes the axis limits, not the actual bins.
My next solution was to use np.fft.fftshift to arrange the data in numerical order and then simply re-scale the axis using this answer: Change the axis scale of imshow. However, the index to frequency bin is not a pure scaling factor, there's typically a constant offset as well.
My attempt was to use 2d hist instead of imshow, but that doesn't work since 2dhist plots the number of times an order pair occurs, while I want to plot a scalar value corresponding to specific order pairs (i.e the power of the signal at specific frequency combinations).
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
f = 200
st = 2500
x = np.linspace(-1,1,2*st)
y = signal.gausspulse(x, fc=f, bw=0.05)
data = np.outer(np.ones(len(y)),y) # A simple example with constant y
Fdata = np.abs(np.fft.fft2(data))**2
freqx = np.fft.fftfreq(len(x))*st # What I want to plot my data against
freqy = np.fft.fftfreq(len(y))*st
plt.imshow(Fdata)
I should see a peak at (200,0) corresponding to the frequency of my signal (with some fall off around it corresponding to bandwidth), but instead my maximum occurs at some random position corresponding to the frequencie's index in my data array. If anyone has any idea, fixes, or other functions to use I would greatly appreciate it!

I cannot run your code, but I think you are looking for the extent= argument to imshow(). See the the page on origin and extent for more information.
Something like this may work?
plt.imshow(Fdata, extent=(freqx[0],freqx[-1],freqy[0],freqy[-1]))

How can I account for identical data points in a scatter plot?

I'm working with some data that has several identical data points. I would like to visualize the data in a scatter plot, but scatter plotting doesn't do a good job of showing the duplicates.
If I change the alpha value, then the identical data points become darker, which is nice, but not ideal.
Is there some way to map the color of a dot to how many times it occurs in the data set? What about size? How can I assign the size of the dot to how many times it occurs in the data set?

As it was pointed out, whether this makes sense depends a bit on your dataset. If you have reasonably discrete points and exact matches make sense, you can do something like this:
import numpy as np
import matplotlib.pyplot as plt
test_x=[2,3,4,1,2,4,2]
test_y=[1,2,1,3,1,1,1] # I am just generating some test x and y values. Use your data here
#Generate a list of unique points
points=list(set(zip(test_x,test_y)))
#Generate a list of point counts
count=[len([x for x,y in zip(test_x,test_y) if x==p[0] and y==p[1]]) for p in points]
#Now for the plotting:
plot_x=[i[0] for i in points]
plot_y=[i[1] for i in points]
count=np.array(count)
plt.scatter(plot_x,plot_y,c=count,s=100*count**0.5,cmap='Spectral_r')
plt.colorbar()
plt.show()
Notice: You will need to adjust the radius (the value 100 in th s argument) according to your point density. I also used the square root of the count to scale it so that the point area is proportional to the counts.
Also note: If you have very dense points, it might be more appropriate to use a different kind of plot. Histograms for example (I personally like hexbin for 2d data) are a decent alternative in these cases.

Using pyplot to draw histogram

I have a list.
Index of list is degree number.
Value is the probability of this degree number.
It looks like, x[ 1 ] = 0.01 means, the degree 1 's probability is 0.01.
I want to draw a distribution graph of this list, and I try
hist = plt.figure(1)
plt.hist(PrDeg, bins = 1)
plt.title("Degree Probability Histogram")
plt.xlabel("Degree")
plt.ylabel("Prob.")
hist.savefig("Prob_Hist")
PrDeg is the list which i mention above.
But the saved figure is not correct.
The X axis value becomes to Prob. and Y is Degree ( Index of list )
How can I exchange x and y axis value by using pyplot ?

Histograms do not usually show you probabilities, they show the count or frequency of observations within different intervals of values, called bins. pyplot defines interval or bins by splitting the range between the minimum and maximum value of your array into n equally sized bins, where n is the number you specified with argument : bins = 1. So, in this case your histogram has a single bin which gives it its odd aspect. By increasing that number you will be able to better see what actually happens there.
The only information that we can get from such an histogram is that the values of your data range from 0.0 to ~0.122 and that len(PrDeg) is close to 1800. If I am right about that much, it means your graph looks like what one would expect from an histogram and it is therefore not incorrect.
To answer your question about swapping the axes, the argument orientation=u'horizontal' is what you are looking for. I used it in the example below, renaming the axes accordingly:
import numpy as np
import matplotlib.pyplot as plt
PrDeg = np.random.normal(0,1,10000)
print PrDeg
hist = plt.figure(1)
plt.hist(PrDeg, bins = 100, orientation=u'horizontal')
plt.title("Degree Probability Histogram")
plt.xlabel("count")
plt.ylabel("Values randomly generated by numpy")
hist.savefig("Prob_Hist")
plt.show()

Manipulating the numpy.random.exponential distribution in Python

I am trying to create an array of random numbers using Numpy's random exponential distribution. I've got this working fine, however I have one extra requirement for my project and that is the ability to specify precisely how many array elements have a certain value.
Let me explain (code is below, but I'll have a go at explaining it here): I generate my random exponential distribution and plot a histogram of the data, producing a nice exponential curve. What I really want to be able to do is use a variable to specify the y-intercept of this curve (point where curve meets the y-axis). I can achieve this in a basic way by changing the number of bins in my histogram, but this only changes the plot and not the original data.
I have inserted the bones of my code here. To give some context, I am trying to create the exponential disc of a galaxy, hence the random array I want to generate is an array of radii and the variable I want to be able to specify is the number density in the centre of the galaxy:
import numpy as N
import matplotlib.pyplot as P
n = 1000
scale_radius = 2
central_surface_density = 100 #I would like this to be the controlling variable, even if it's specification had knock on effects on n.
radius_array = N.random.exponential(scale_radius,(n,1))
P.figure()
nbins = 100
number_density, radii = N.histogram(radius_array, bins=nbins,normed=False)
P.plot(radii[0:-1], number_density)
P.xlabel('$R$')
P.ylabel(r'$\Sigma$')
P.ylim(0, central_surface_density)
P.legend()
P.show()
This code creates the following histogram:
So, to summarise, I would like to be able to specify where this plot intercepts the y-axis by controlling how I've generated the data, not by changing how the histogram has been plotted.
Any help or requests for further clarification would be very much appreciated.

According to the docs for numpy.random.exponential, the input parameter beta, is 1/lambda for the definition of the exponential described in wikipedia.
What you want is this function evaluated at f(x=0)=lambda=1/beta. Therefore in a normed distribution, your y-intercept should just be the inverse of the numpy function:
import numpy as np
import pylab as plt
target = 250
beta = 1.0/target
Y = np.random.exponential(beta, 5000)
plt.hist(Y, normed=True, bins=200,lw=0,alpha=.8)
plt.plot([0,max(Y)],[target,target],'r--')
plt.ylim(0,target*1.1)
plt.show()
Yes the y-intercept of the histogram will change with different bin sizes, but this doesn't mean anything. The only thing that you can reasonably talk about here is the underlying probability distribution (hence the normed=true)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.