How to draw a histogram when some bins dominate the others

How to draw a histogram when some bins dominate the others - python

I would like to draw a histogram that explains how the data is distributed. My problem is that most of the data have very small values. Hence, if you use 10 bins, it won't be so descriptive; most of the data squeeze in 0.0-0.1 bin. If you use 1000 bins, then histogram does not look good because of the xlabels
and some bins overlap the others since we have too much bins.
I tried to use such as log-scale, normalized version as well but still I couldn't get an informative histogram. I have already calculated the (1000) bins and the counts. The code for reading the data is below. You can run it: ./sub-histogram.py hist-data.txt 2500 0. 0 means you use the raw counts (first line). The last line contains the bin values.
First idea is to merge counts and bins with some threshold. If the counts smaller than some threshold, accumulate this count and skip this bin. I don't have any further idea right now, but I am sure that if you use histogram you've come across this issue. Is there
any solution for such cases? Data and everything is here.
import sys
from itertools import izip
import matplotlib.pyplot as plt
import numpy as np
lines = open(sys.argv[1]).readlines()
threshold = float(sys.argv[2])
count_type = int(sys.argv[3]) # 0 for raw counts, 1 for normalized counts, 2 for log counts
# reading
C = map(float, lines[count_type][1:-2].replace(",", "").split())
B = map(float, lines[3][1:-2].replace(",", '').split())
# merging method.
# accumulate the counts with respect to threshold.
counts = []
bins = []
ct = 0
for c, b in izip(C,B):
ct += c
if ct >= threshold:
counts.append(ct)
bins.append(b)
ct = 0
if ct > 0:
counts.append(ct)
bins.append(b)
ct = 0
print counts
print bins
bar_width= 0.005
plt.xticks(np.linspace(0,2,41))
plt.bar(bins, counts, bar_width)
plt.show()

I would suggest having a number of bins for your small values and a bigger than bin e.g. 100 bins for values in the range 0.000 to 0.200 with an interval of 0.002 and one bin for everything over 0.200, (you could possibly have 10 bins for 0.000-0.009, ten for 0.010-0.090, etc.,), you will then need to override the labels on the X-axis but ax.set_xticklabels lets you do that.

Related

bin counts in stacked histogram (weighted) with x-coordinate greater than certain value

Lets say I have two datasets, and then I plot a stacked histograms of both datasets with some weight. Now, can I know what is the total bin counts for data elements greater than certain number (i.e. for x-coordinate greater than a certain value). To illustrate my question, I have done the following
import matplotlib.pyplot as plt
import numpy as np
data1 = np.random.normal(0,0.6,1000)
data2 = np.random.normal(0,1.4,1000)
weight1 = np.array([0.5]*len(data1))
weight2 = np.array([0.9]*len(data2))
hist = plt.hist((data1,data2),weights=(weight1,weight2),stacked=True,range=(-5,5))
plt.show()
Now, how would I know the the bin counts, say for x greater than -2?
As of now, to get that answer, I was doing the following
n1,_,_ = plt.hist((data1,data2),weights=(weight1,weight2),stacked=False,range=(-2,10000))
bin_counts=sum(sum(n1))
print(bin_counts)
Here, I choose the max value in range to be some crazily large number, so that I get all the bin counts for x=-2 and greater.
Is there any more efficient way than this?
Also, what would be the way to obtain the bin_counts for a variable x, where x varies from the minimum value of x-coordinate to maximum value of x-coordinate in some steps?
Any help will be greatly appreciated!
Thanks much!

You could do the following:
#in your case n is going to be a list of arrays, because you have 2 histograms
n,bins,_ = plt.hist(...)
#get a list of lists of counts for bin values over x
n_over_x = [[val for val,bin in zip(selected_cnt, bins) if bin > x] for selected_cnt in n]
#sum up list of lists
result = sum([sum(part_list) for part_list in n_over_x])

here's what I came up with,
def my_range(start, end, step):
while start <= end:
yield start
start += step
b_counts=[0]*len(data1) #here b_counts is the normalized events (i mean normalized according to the weights)
value=[0]*len(data1)
bin_min=-5
bin_max=10
bin_step=1
count_max = (bin_max-bin_min)/bin_step
for i in my_range(bin_min,count_max,1):
n1,_,_ = plt.hist((data1,data2),weights=(weight1,weight2),stacked=False,range=(i*bin_step,10000))
b_counts[i] = sum(sum(n1))
value[i] = i*bin_step #here value is exactly equal to "i", but I am writing this for a general case
print(b_counts[i],value[I])
I do believe that this gives me the events (in the histogram) in the range (value,10000), where the value is the variable

Mean of y value in vertical bin

So I have Stock market data (date from 0 onwards, & a close price) and with this I use numpy.fft to calculate the fast fourier transform, and corresponding frequencies, and then have these in the form of a zipped list, 'FFT,Frequency'. I have the Frequency values separated into vertical logarithmic bins using:
logbins = np.logspace(Min(logX),max(logX),numbins=10, base=10.0
I then digitize the Frequency values into these bins and use:
for k in range(1,len(freqlogbins)):
mean_freq.append(np.mean(Tfreq2[freqdig==k]))
Which works fine, however I need to also somehow work out the mean of the y values in each bin.
I imagine it is somehow possible referring to the x values in the ZippedList[0,i], and the y values as ZippedList[1,i]
but can't quite work out how.
Here is an example of copyable code:
import numpy as np
T_date0=np.arange(0,400)
T_price=np.random.uniform(100,400, size=400)
T_fft=np.fft.fft(T_price)
Tfreq=np.fft.fftfreq(T_date0.shape[-1])
I then Eradicate any negative frequency values and the corresponding fft values using:
Tfreq2=[]
T_fft2=[]
for i in range(len(Tfreq)):
if Tfreq[i]>0:
Tfreq2.append(Tfreq[i])
T_fft2.append(T_fft[i])
T_fft_absSq=(np.absolute(T_fft2))**2
logTFAS=np.log10(T_fft_absSq)
logTfreq=np.log10(Tfreq2)
numbins=10
logbins=np.logspace((min(logTfreq)-0.00000001),(max(logTfreq)+0.00000001),num=numbins, base=10.0) #The +/-0.00000001 are so that the endpoints lie in the bin intervals.
Tfreq2=np.array(Tfreq2)
TFAS=np.array(T_fft_absSq)
freqdig=np.digitize(Tfreq2,logbins)
mean_freq=[]
mean_fft=[]
for k in range(1,len(logbins)):
mean_freq.append(np.mean(Tfreq2[freqdig==k]))
Fourier=zip(logTfreq,logTFAS)
##This is where I need to work out the mean of the y values, in the vertical bins
Here is what the data looks like, where the black dashed lines represent the bins, and the dashed yellow lines represent the mean of the x values in each bin. The blue line is a 2st order polynomial fit.
Obviously with random data it will look a little different to the link I posted below, but it gives an idea.

I was overthinking everything....
I was able to calculate the y value averages in a very similar way, using the frequency binning as such:
for k in range(1,len(logbins)):
mean_freq.append(np.mean(np.array(logTfreq)[freqdig==k]))
mean_fft.append(np.mean(np.array(logTFAS)[freqdig==k]))

Not quite sure what your asking for but maybe np.digitize will help:
import numpy as np
d = np.random.random(1000)
bins = np.linspace(0, 1, 10)
dig = np.digitize(d, bins)
binmean = [d[dig == i].mean() for i in range(1, len(bins))]
print binmean

Using pyplot to draw histogram

I have a list.
Index of list is degree number.
Value is the probability of this degree number.
It looks like, x[ 1 ] = 0.01 means, the degree 1 's probability is 0.01.
I want to draw a distribution graph of this list, and I try
hist = plt.figure(1)
plt.hist(PrDeg, bins = 1)
plt.title("Degree Probability Histogram")
plt.xlabel("Degree")
plt.ylabel("Prob.")
hist.savefig("Prob_Hist")
PrDeg is the list which i mention above.
But the saved figure is not correct.
The X axis value becomes to Prob. and Y is Degree ( Index of list )
How can I exchange x and y axis value by using pyplot ?

Histograms do not usually show you probabilities, they show the count or frequency of observations within different intervals of values, called bins. pyplot defines interval or bins by splitting the range between the minimum and maximum value of your array into n equally sized bins, where n is the number you specified with argument : bins = 1. So, in this case your histogram has a single bin which gives it its odd aspect. By increasing that number you will be able to better see what actually happens there.
The only information that we can get from such an histogram is that the values of your data range from 0.0 to ~0.122 and that len(PrDeg) is close to 1800. If I am right about that much, it means your graph looks like what one would expect from an histogram and it is therefore not incorrect.
To answer your question about swapping the axes, the argument orientation=u'horizontal' is what you are looking for. I used it in the example below, renaming the axes accordingly:
import numpy as np
import matplotlib.pyplot as plt
PrDeg = np.random.normal(0,1,10000)
print PrDeg
hist = plt.figure(1)
plt.hist(PrDeg, bins = 100, orientation=u'horizontal')
plt.title("Degree Probability Histogram")
plt.xlabel("count")
plt.ylabel("Values randomly generated by numpy")
hist.savefig("Prob_Hist")
plt.show()

how to scale colorbar based on ylim

I am plotting a spectrogram of my data using matplotlib's specgram function.
Pxx, freqs, bins= mlab.specgram(my_data,NFFT=nFFT,Fs=Fs,detrend=mlab.detrend_linear,noverlap=n_overlap,pad_to=p_to,scale_by_freq=True)
For ref, the shape of "freqs", "bins" (i.e. times) and "Pxx" above are (1025,), (45510,) and (1025,45510) respectively.
where, I have defined the function parameters
Fs = 10E6 # Sampling Rate
w_length= 256 # window length
nFFT=2 * w_length
n_overlap=np.fix(w_length/2)
p_to = 8 *w_length
The frequency range (yaxis) for this plot is from 0 to 5E6 Hz. When I plot it, I am interested in viewing different frequency ranges, for example 100E3 Hz to 1E6. If I change the ylim of the plot, the colorbar limits don't change i.e. don't update to reflect the signal values in this "new" frequency range. Is there a way that I can do this, so that by changing the y-axis range plotted i.e. the frequency range limits , the colorbar will update/change accordingly?
interp='nearest'
cmap=seismic
fig = plt.figure()
ax1=fig.add_subplot(111)
img1=ax1.imshow(Pxx, interpolation=interp, aspect='auto',extent=extent,cmap=cmap)
ax1.autoscale(enable=True,axis='x',tight=True)
ax1.autoscale(enable=True,axis='y',tight=True)
ax1.set_autoscaley_on(False)
ax1.set_ylim([100E3,1E6])
fig.colorbar(img1)
plt.show()
I thought that if I could somehow find what the maximum and minimum value of Pxx was for the upper and lower frequencies respectively in the frequency range of interest, that I could use these values to set the colorbar limit e.g.
img1.set_clim(min_val, max_val)
I can find the max and min values of Pxx in general and return their indices using
import numpy as np
>>> np.unravel_index(Pxx.argmax(),Pxx.shape)
(20, 31805)
>>> np.unravel_index(Pxx.argmin(),Pxx.shape)
(1024, 31347)
How do I go about finding the values of Pxx that correspond to the freq range of interest?
I can do something like the following to roughly find where for example in "freqs" 100E3 and 1E6 are approx. located using (and take the first (or last) value from each )...
fmin_index= [i for i,x in enumerate(freqs) if x >= 100E3][0]
fmax_index= [i for i,x in enumerate(freqs) if x >= 1000E3][0]
OR
fmin_index= [i for i,x in enumerate(freqs) if x <= 100E3][-1]
fmax_index= [i for i,x in enumerate(freqs) if x <= 1000E3][-1]
Then possibly
min_val = np.min(Pxx[fmin_index,:])
max_val = np.min(Pxx[fmax_index,:])
and finally
img1.set_clim(min_val, max_val)
Unfortunately this doesn't appear to be working in the sense that value range on the colorbar doesn't look correct. There must be a better/easier/more accurate way to do the above.

Instead of changing the limits in the graph, a possible solution is to change the data you plot and let colorbar do its thing. A minimal working example in the pylab environment:
#some random data
my_data = np.random.random(2048)
#### Your Code
Fs = 10E6 # Sampling Rate
w_length= 256 # window length
nFFT=2 * w_length
n_overlap=np.fix(w_length/2)
p_to = 8 *w_length
Pxx, freqs, bins= mlab.specgram(my_data,NFFT=nFFT,Fs=Fs,
detrend=mlab.detrend_linear,
noverlap=n_overlap,
pad_to=p_to,scale_by_freq=True)
#find a maximum frequency index
maxfreq = 1E5 #replace by your maximum freq
if maxfreq:
lastfreq = freqs.searchsorted(maxfreq)
if lastfreq > len(freqs):
lastfreq = len(freqs)-1
Pxx = np.flipud(Pxx) #flipping image in the y-axis
interp='nearest'
seismic = plt.get_cmap('seismic')
cmap=seismic
fig = plt.figure()
ax1=fig.add_subplot(111)
extent = 0,4,freqs[0],freqs[lastfreq] # new extent
#plot reduced range
img1=ax1.imshow(Pxx[-lastfreq:], interpolation=interp, aspect='auto',
extent=extent ,cmap=cmap)
ax1.set_autoscaley_on(False)
fig.colorbar(img1)
plt.show()
My example only sets a maximum frequency, but with some small tweaks you can set a minimum.

How to best utilize the hist() to show a cumulative and normed histogram?

I have a problem while dealing with a data set which the value range from 0 to tens of thousand. And there is no problem to show the histogram of the whole data set using hist(). However, if I only want to show the cumulative and normed detailed histogram using say x = [0, 120], I have to use 600000 bins to assure the detail.
The tricky problem is if I just use the range of (0 ,120) to show normed and cumulative hist, it will end with 1. But actually it is far less than the real '1' since it just normed within this small range of data. Could anyone have some ideas how to utilize the hist() in matplotlib to tackle this problem? I thought this should not be so complicated that I have to write another function to draw the hist I need.

You can set bins to a list, not an integer, e.g., bins=[1,2,3,..,120,30000,60000].
To answer your commnet below, here is an excerpt from the documentation:
bins:
Either an integer number of bins or a sequence giving the bins. If bins is an integer, bins + 1 bin edges will be returned, consistent with numpy.histogram() for numpy version >= 1.3, and with the new = True argument in earlier versions. Unequally spaced bins are supported if bins is a sequence.
And here is an example with cumulative normalized histogram. Notice the effect of bins = [100,125,150,160,170,180,190,200,210,220,230,240,250,275,300] on this bar plot, how the first two bars are wider than the middle bars.

Hmmm, I guess this is related to your previous question (Memory error when dealing with huge data). My suggestion there doesn't seem to work for a cumulative histogram.
I can't get plt.hist() to play nice with cyborg's suggestion, so I did the cumsum and normalisation by hand:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import normal
inp = np.abs(normal(0, 100000, 100000))
bins = range(0, 120)
a,b = np.histogram(inp, bins = bins)
bar_edges = b[:-1]
bar_width = b[1] - b[0]
bar_height = (np.cumsum(a) + sum(inp<min(bins))) / len(inp)
plt.figure(1)
plt.bar(bar_edges, bar_height, width = bar_width)
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to draw a histogram when some bins dominate the others - python

Related

bin counts in stacked histogram (weighted) with x-coordinate greater than certain value

Mean of y value in vertical bin

Using pyplot to draw histogram

how to scale colorbar based on ylim

How to best utilize the hist() to show a cumulative and normed histogram?

Categories

Resources