Mean of y value in vertical bin - python

So I have Stock market data (date from 0 onwards, & a close price) and with this I use numpy.fft to calculate the fast fourier transform, and corresponding frequencies, and then have these in the form of a zipped list, 'FFT,Frequency'. I have the Frequency values separated into vertical logarithmic bins using:
logbins = np.logspace(Min(logX),max(logX),numbins=10, base=10.0
I then digitize the Frequency values into these bins and use:
for k in range(1,len(freqlogbins)):
mean_freq.append(np.mean(Tfreq2[freqdig==k]))
Which works fine, however I need to also somehow work out the mean of the y values in each bin.
I imagine it is somehow possible referring to the x values in the ZippedList[0,i], and the y values as ZippedList[1,i]
but can't quite work out how.
Here is an example of copyable code:
import numpy as np
T_date0=np.arange(0,400)
T_price=np.random.uniform(100,400, size=400)
T_fft=np.fft.fft(T_price)
Tfreq=np.fft.fftfreq(T_date0.shape[-1])
I then Eradicate any negative frequency values and the corresponding fft values using:
Tfreq2=[]
T_fft2=[]
for i in range(len(Tfreq)):
if Tfreq[i]>0:
Tfreq2.append(Tfreq[i])
T_fft2.append(T_fft[i])
T_fft_absSq=(np.absolute(T_fft2))**2
logTFAS=np.log10(T_fft_absSq)
logTfreq=np.log10(Tfreq2)
numbins=10
logbins=np.logspace((min(logTfreq)-0.00000001),(max(logTfreq)+0.00000001),num=numbins, base=10.0) #The +/-0.00000001 are so that the endpoints lie in the bin intervals.
Tfreq2=np.array(Tfreq2)
TFAS=np.array(T_fft_absSq)
freqdig=np.digitize(Tfreq2,logbins)
mean_freq=[]
mean_fft=[]
for k in range(1,len(logbins)):
mean_freq.append(np.mean(Tfreq2[freqdig==k]))
Fourier=zip(logTfreq,logTFAS)
##This is where I need to work out the mean of the y values, in the vertical bins
Here is what the data looks like, where the black dashed lines represent the bins, and the dashed yellow lines represent the mean of the x values in each bin. The blue line is a 2st order polynomial fit.
Obviously with random data it will look a little different to the link I posted below, but it gives an idea.

I was overthinking everything....
I was able to calculate the y value averages in a very similar way, using the frequency binning as such:
for k in range(1,len(logbins)):
mean_freq.append(np.mean(np.array(logTfreq)[freqdig==k]))
mean_fft.append(np.mean(np.array(logTFAS)[freqdig==k]))

Not quite sure what your asking for but maybe np.digitize will help:
import numpy as np
d = np.random.random(1000)
bins = np.linspace(0, 1, 10)
dig = np.digitize(d, bins)
binmean = [d[dig == i].mean() for i in range(1, len(bins))]
print binmean

Related

python signal.scipy.welch for complex input returns frequency indices which are not ordered negative to positive

My aim is to plot the PSD of a complex vector x.
I calculated the spectrum estimation using scipy.welch (version 1.4.1):
f, Px = scipy.signal.welch(**x**, return_onesided=False, detrend=False)
and then plotted:
plt.plot(f, 10*np.log10(Px),'.-')
plt.show()
The PSD was plotted fine, but I noticed an interpolation line from the last sample to the first on the plot. I then checked the frequency indices and noticed that they are ordered from DC(0) to half the sample rate(0.5 in this case) and then from -0.5 to almost zero. This is why the plot has a straight line across from Px(0.5) to Px(-0.5).
Why the returned f vector(and the appropriate Px) is not from -0.5 to 0.5 ?
Can someone suggest a straight forward method? (I'm used to MATLAB and it is much simpler to plot a PSD there...)
Thanks
Think about angles a complex value moves between two samples, it can be from 0 to 360 (this is the order that welch will return), but another way to see for say 200 degrees is that it is a 360-200=160 degrees to the other direction.
Since you are asking ti to bring the two sided spectrum
import numpy as np
import scipy.signal
import matplotlib.pyplot as plt
x = np.random.randn(1000)
x = x + np.roll(x, 1) + np.roll(x, -1) # rolling mean
f, p = scipy.signal.welch(x, detrend=False, return_onesided=False)
It will give you the positive frequency followed by the negative frequency spectrum. Numpy provides a fftshift function to rearrange a fft frequency vector. Check what the frequency (the x axis looks like)
plt.plot(f)
plt.plot(np.fft.fftshift(f))
plt.legend(['welch return', 'fftshifted'])
So if you plot directly you will see the line connecting the last point of the positive frequency spectrum to the first point of the negative frequency spectrum
plt.plot(f, np.log(p))
If you reorder both f and p you see the expected result
plt.plot(np.fft.fftshift(f), np.fft.fftshift(p))
Note that, for real data welch will return the same values for the negative part and the positive part.

Number density distribution of an 1D-array - 2 different attempts

I have an large array of elements that I call RelDist (In which dimensionally, is a unit of distance) in a simulated volume. I am attempting to determine the distribution for the "number of values per unit volume" which is also number density. It should be similar to this diagram:
I am aware that the axis is scaled log base 10, the plot of the set should definitely drop off.
Mathematically, I set it up as two equivalent equations:
where N is the number of elements in the array being differentiated in respect to the natural log of the distances. It can also be equivalently re-written in the form of a regular derivative by introducing another factor of r.
Equivalently,
So for ever increasing r, I want to count the change in N of elements per logarithmic bin of r.
As of now, I have trouble setting up the frequency counting in the histogram while accommodating the volume along side it.
Attempt 1
This is using the dN/dlnr/volume equations
def n(dist, numbins):
logdist= np.log(dist)
hist, r_array = np.histogram(logdist, numbins)
dlogR = r_array[1]-r_array[0]
x_array = r_array[1:] - dlogR/2
## I am condifent the above part of this code is correct.
## The succeeding portion does not work.
dR = r_array[1:] - r_array[0:numbins]
dN_dlogR = hist * x_array/dR
volume = 4*np.pi*dist*dist*dist
## The included volume is incorrect
return [x_array, dN_dlogR/volume]
Plotting this does not even properly show a distribution like the first plot I posted above and it only works when I choose the bin number to be the same shape as my input array. The bun number should arbitrary, should it not?
Attempt 2
This is using the equivalent dN/dr/volume equation.
numbins = np.linspace(min(RelDist),max(RelDist), 100)
hist, r_array = np.histogram(RelDist, numbins)
volume = 4*np.float(1000**2)
dR = r_array[1]-r_array[0]
x_array = r_array[1:] - dR/2
y = hist/dR
A little bit easier, but without including the volume term, I get a sort of histogram distribution, which is at least a start.
With this attempt, how would include the volume term with the array?
Example
Start at a distance R value of something like 10, counts the change in number in respect to R, then increasing to a distance value R of 20, counts the change, increase to value of 30, counts the change, and so on so forth.
Here is a txt file of my array if you are interested in re-creating it
https://www.dropbox.com/s/g40gp88k2p6pp6y/RelDist.txt?dl=0
Since no one was able to help answer, I will provide my result in case someone wants to use it for future use:
def n_ln(dist, numbins):
log_dist = np.log10(dist)
bins = np.linspace(min(log_dist),max(log_dist), numbins)
hist, r_array = np.histogram(log_dist, bins)
dR = r_array[1]-r_array[0]
x_array = r_array[1:] - dR/2
volume = [4.*np.pi*i**3. for i in 10**x_array[:] ]
return [10**x_array, hist/dR/volume]

Using pyplot to draw histogram

I have a list.
Index of list is degree number.
Value is the probability of this degree number.
It looks like, x[ 1 ] = 0.01 means, the degree 1 's probability is 0.01.
I want to draw a distribution graph of this list, and I try
hist = plt.figure(1)
plt.hist(PrDeg, bins = 1)
plt.title("Degree Probability Histogram")
plt.xlabel("Degree")
plt.ylabel("Prob.")
hist.savefig("Prob_Hist")
PrDeg is the list which i mention above.
But the saved figure is not correct.
The X axis value becomes to Prob. and Y is Degree ( Index of list )
How can I exchange x and y axis value by using pyplot ?
Histograms do not usually show you probabilities, they show the count or frequency of observations within different intervals of values, called bins. pyplot defines interval or bins by splitting the range between the minimum and maximum value of your array into n equally sized bins, where n is the number you specified with argument : bins = 1. So, in this case your histogram has a single bin which gives it its odd aspect. By increasing that number you will be able to better see what actually happens there.
The only information that we can get from such an histogram is that the values of your data range from 0.0 to ~0.122 and that len(PrDeg) is close to 1800. If I am right about that much, it means your graph looks like what one would expect from an histogram and it is therefore not incorrect.
To answer your question about swapping the axes, the argument orientation=u'horizontal' is what you are looking for. I used it in the example below, renaming the axes accordingly:
import numpy as np
import matplotlib.pyplot as plt
PrDeg = np.random.normal(0,1,10000)
print PrDeg
hist = plt.figure(1)
plt.hist(PrDeg, bins = 100, orientation=u'horizontal')
plt.title("Degree Probability Histogram")
plt.xlabel("count")
plt.ylabel("Values randomly generated by numpy")
hist.savefig("Prob_Hist")
plt.show()

How to draw a histogram when some bins dominate the others

I would like to draw a histogram that explains how the data is distributed. My problem is that most of the data have very small values. Hence, if you use 10 bins, it won't be so descriptive; most of the data squeeze in 0.0-0.1 bin. If you use 1000 bins, then histogram does not look good because of the xlabels
and some bins overlap the others since we have too much bins.
I tried to use such as log-scale, normalized version as well but still I couldn't get an informative histogram. I have already calculated the (1000) bins and the counts. The code for reading the data is below. You can run it: ./sub-histogram.py hist-data.txt 2500 0. 0 means you use the raw counts (first line). The last line contains the bin values.
First idea is to merge counts and bins with some threshold. If the counts smaller than some threshold, accumulate this count and skip this bin. I don't have any further idea right now, but I am sure that if you use histogram you've come across this issue. Is there
any solution for such cases? Data and everything is here.
import sys
from itertools import izip
import matplotlib.pyplot as plt
import numpy as np
lines = open(sys.argv[1]).readlines()
threshold = float(sys.argv[2])
count_type = int(sys.argv[3]) # 0 for raw counts, 1 for normalized counts, 2 for log counts
# reading
C = map(float, lines[count_type][1:-2].replace(",", "").split())
B = map(float, lines[3][1:-2].replace(",", '').split())
# merging method.
# accumulate the counts with respect to threshold.
counts = []
bins = []
ct = 0
for c, b in izip(C,B):
ct += c
if ct >= threshold:
counts.append(ct)
bins.append(b)
ct = 0
if ct > 0:
counts.append(ct)
bins.append(b)
ct = 0
print counts
print bins
bar_width= 0.005
plt.xticks(np.linspace(0,2,41))
plt.bar(bins, counts, bar_width)
plt.show()
I would suggest having a number of bins for your small values and a bigger than bin e.g. 100 bins for values in the range 0.000 to 0.200 with an interval of 0.002 and one bin for everything over 0.200, (you could possibly have 10 bins for 0.000-0.009, ten for 0.010-0.090, etc.,), you will then need to override the labels on the X-axis but ax.set_xticklabels lets you do that.

How to best utilize the hist() to show a cumulative and normed histogram?

I have a problem while dealing with a data set which the value range from 0 to tens of thousand. And there is no problem to show the histogram of the whole data set using hist(). However, if I only want to show the cumulative and normed detailed histogram using say x = [0, 120], I have to use 600000 bins to assure the detail.
The tricky problem is if I just use the range of (0 ,120) to show normed and cumulative hist, it will end with 1. But actually it is far less than the real '1' since it just normed within this small range of data. Could anyone have some ideas how to utilize the hist() in matplotlib to tackle this problem? I thought this should not be so complicated that I have to write another function to draw the hist I need.
You can set bins to a list, not an integer, e.g., bins=[1,2,3,..,120,30000,60000].
To answer your commnet below, here is an excerpt from the documentation:
bins:
Either an integer number of bins or a sequence giving the bins. If bins is an integer, bins + 1 bin edges will be returned, consistent with numpy.histogram() for numpy version >= 1.3, and with the new = True argument in earlier versions. Unequally spaced bins are supported if bins is a sequence.
And here is an example with cumulative normalized histogram. Notice the effect of bins = [100,125,150,160,170,180,190,200,210,220,230,240,250,275,300] on this bar plot, how the first two bars are wider than the middle bars.
Hmmm, I guess this is related to your previous question (Memory error when dealing with huge data). My suggestion there doesn't seem to work for a cumulative histogram.
I can't get plt.hist() to play nice with cyborg's suggestion, so I did the cumsum and normalisation by hand:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import normal
inp = np.abs(normal(0, 100000, 100000))
bins = range(0, 120)
a,b = np.histogram(inp, bins = bins)
bar_edges = b[:-1]
bar_width = b[1] - b[0]
bar_height = (np.cumsum(a) + sum(inp<min(bins))) / len(inp)
plt.figure(1)
plt.bar(bar_edges, bar_height, width = bar_width)
plt.show()

Categories

Resources