Histogram has only one bar - python

My data--a 196,585-record numpy array extracted from a pandas dataframe--are being placed into a single bin by matplotlib.hist. The data were originally integers, so I tried converting them to float as wel, as shown below, but they are still not being distributed among 10 bins.
Interestingly, a small sub-sample (using df.sample(0.00x)) of the integer data are successfully distributed.
Any suggestions on where I may be erring in data preparation or use of matplotlib's histogram function would be appreciated.
x = df[(df['UNIT']=='X')].OPP_VALUE.values
num_bins = 10
n, bins, patches = plt.hist((x[(x>0)]).astype(float), num_bins, normed=False, facecolor='0.5', alpha=0.8)
plt.show()

Most likely what is happening is that the number of data points with x > 0.5 is very small but you do have some outliers that forces the hist function to pick the scale it does. Try removing all values > 0.5 (or 1 if you do not want to convert to float) and then plot again.

you should modify number of bins, for exam
number_of_bins = 200
bin_cutoffs = np.linspace(np.percentile(x,0), np.percentile(x,99),number_of_bins)

Related

Display all the bins on sns distplot [duplicate]

To simplify my problem (it's not exactly like that but I prefer simple answers to simple questions):
I have several 2D maps that portray rectangular region areas. I'd like to add on the map axes and ticks to show the distances on this map (with matplotlib, since the old code is with it), but the problem is that the areas are different sized. I'd like to put on the axes nice, clear ticks, but the widths and heights of the maps can be anything...
To try to explain what I mean: Let's say I have a map of a region whose size is 4.37 km * 6.42 km. I want that there is on x-axis ticks on 0, 1, 2, 3, and 4 km:s and on y-axis ticks on 0, 1, 2, 3, 4, 5, and 6 km:s. However, the image and the axes reach a bit further than to 4 km and 6 km, since the region is larger then 4 km * 6 km.
The space between the ticks can be constant, 1 km. However, the sizes of the maps vary quite a lot (let's say, between 5-15 km), and they are float values. My current script knows the size of the region and can scale the image into right height/width ratio, but how to tell it where to put the ticks?
There may be already solution for this problem, but since I couldn't find suitable search words for my problem, I had to ask it here...
Just set the tick locator to use matplotlib.ticker.MultipleLocator(x) where x is the spacing that you want (e.g. 1.0 in your example above).
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
x = np.arange(20)
y = x * 0.1
fig, ax = plt.subplots()
ax.plot(x, y)
ax.xaxis.set_major_locator(MultipleLocator(1.0))
ax.yaxis.set_major_locator(MultipleLocator(1.0))
# Forcing the plot to be labeled with "plain" integers instead of scientific notation
ax.xaxis.set_major_formatter(FormatStrFormatter('%i'))
plt.show()
The advantage to this is that no matter how we zoom or interact with the plot, it will always be labeled with ticks 1 unit apart.
This should give you ticks at all integer values within your current axis limits on the x axis:
from matplotlib import pylab as plt
import math
# get values for the axis limits (unless you already have them)
xmin,xmax = plt.xlim()
# get the outermost integer values using floor and ceiling
# (I need to convert them to int to avoid a DeprecationWarning),
# then get all the integer values between them using range
new_xticks = range(int(math.ceil(xmin)),int(math.floor(xmax)+1))
plt.xticks(new_xticks,new_xticks)
# passing the same argment twice here because the first gives the tick locations
# and the second gives the tick labels, which should just be the numbers
Repeat for the y axis.
Out of curiosity: what kind of ticks do you get by default?
Okay, I tried your versions, but unfortunately I couldn't make them work, since there was some scaling and PDF locating stuff that made me (and your code suggestions) badly confused. But by testing them, I learned again a lot of python, thanks!
I managed finally to find a solution that isn't very exact but satisfies my needs. Here is how I did it.
In my version, one km is divided by a suitable integer constant named STEP_PART. The bigger is STEP_PART, the more accurate the axis values are (and if it is too big, the axis becomes messy to read). For example, if STEP_PART is 5, the accuracy is 1 km / 5 = 200 m, and ticks are put to every 200 m.
STEP_PART = 5 # In the start of the program.
height = 6.42 # These are actually given elsewhere,
width = 4.37 # but just as example...
vHeight = range(0, int(STEP_PART*height), 1) # Make tick vectors, now in format
# 0, 1, 2... instead of 0, 0.2...
vWidth = range(0, int(STEP_PART*width), 1) # Should be divided by STEP_PART
# later to get right values.
To avoid making too many axis labels (0, 1, 2... are enough, 0, 0.2, 0.4... is far too much), we replace non-integer km values with string "". Simultaneously, we divide integer km values by STEP_PART to get right values.
for j in range(len(vHeight)):
if (j % STEP_PART != 0):
vHeight[j] = ""
else:
vHeight[j] = int(vHeight[j]/STEP_PART)
for i in range(len(vWidth)):
if (i % STEP_PART != 0):
vWidth[i] = ""
else:
vWidth[i] = int(vWidth[i]/STEP_PART)
Later, after creating the graph and axes, ticks are put in that way (x axis as an example). There, x is the actual width of the picture, got with shape() command (I don't exactly understand how... there is quite a lot scaling and stuff in the code I'm modifying).
xt = np.linspace(0,x-1,len(vWidth)+1) # For locating those ticks on the same distances.
locs, labels = mpl.xticks(xt, vWidth, fontsize=9)
Repeat for y axis. The result is a graph where is ticks on every 200 m's but data labels on the integer km values. Anyway, the accuracy of those axes are 200 m's, it's not exact but it was enough for me. The script will be even better if I find out how to grow the size of the integer ticks...

How to create uint16 gaussian noise image?

I want to create a uint16 image of gaussian noise with a defined mean and standard deviation.
I've tried using numpy's random.normal for this but it returns a float64 array:
mu = 10
sigma = 100
shape = (1024,1024)
gauss_img = np.random.normal(mu, sigma, shape)
print(gauss_img.dtype)
>>> dtype('float64')
Is there a way to convert gauss_img to a uint16 array while preserving the original mean and standard deviation? Or is there another approach entirely to creating a uint16 noise image?
EDIT: As was mentioned in the comments, np.random.normal will inevitably sample negative values given a sd > mean, which is a problem for converting to uint16.
So I think I need a different method that will create an unsigned gaussian image directly.
So I think this is close to what you're looking for.
Import libraries and spoof some skewed data. Here, since the input is of unknown origin, I created skewed data using np.expm1(np.random.normal()). You could use skewnorm().rvs() as well, but that's kind of cheating since that's also the lib you'll use to characterize it.
I flatten the raw samples to make plotting histograms easier.
import numpy as np
from scipy.stats import skewnorm
# generate dummy raw starting data
# smaller shape just for simplicity
shape = (100, 100)
raw_skewed = np.maximum(0.0, np.expm1(np.random.normal(2, 0.75, shape))).astype('uint16')
# flatten to look at histograms and compare distributions
raw_skewed = raw_skewed.reshape((-1))
Now find the params that characterize your skewed data, and use those to create a new distribution to sample from that hopefully matches your original data well.
These two lines of code are really what you're after I think.
# find params
a, loc, scale = skewnorm.fit(raw_skewed)
# mimick orig distribution with skewnorm
new_samples = skewnorm(a, loc, scale).rvs(10000).astype('uint16')
Now plot the distributions of each to compare.
plt.hist(raw_skewed, bins=np.linspace(0, 60, 30), hatch='\\', label='raw skewed')
plt.hist(new_samples, bins=np.linspace(0, 60, 30), alpha=0.65, color='green', label='mimic skewed dist')
plt.legend()
The histograms are pretty close. If that looks good enough, reshape your new data to the desired shape.
# final result
new_samples.reshape(shape)
Now... here's where I think it probably falls short. Take a look at the heatmap of each. The original distribution had a longer tail to the right (more outliers that skewnorm() didn't characterize).
This plots a heatmap of each.
# plot heatmaps of each
fig = plt.figure(2, figsize=(18,9))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
im1 = ax1.imshow(raw_skewed.reshape(shape), vmin=0, vmax=120)
ax1.set_title("raw data - mean: {:3.2f}, std dev: {:3.2f}".format(np.mean(raw_skewed), np.std(raw_skewed)), fontsize=20)
im2 = ax2.imshow(new_samples.reshape(shape), vmin=0, vmax=120)
ax2.set_title("mimicked data - mean: {:3.2f}, std dev: {:3.2f}".format(np.mean(new_samples), np.std(new_samples)), fontsize=20)
plt.tight_layout()
# add colorbar
fig.subplots_adjust(right=0.85)
cbar_ax = fig.add_axes([0.88, 0.1, 0.08, 0.8]) # [left, bottom, width, height]
fig.colorbar(im1, cax=cbar_ax)
Looking at it... you can see occasional flecks of yellow indicating very high values in the original distribution that didn't make it into the output. This also shows up in the higher std dev of the input data (see titles in each heatmap, but again, as in comments to original question... mean & std don't really characterize the distributions since they're not normal... but they're in as a relative comparison).
But... that's just the problem it has with the very specific skewed sample i created to get started. There's hopefully enough here to mess around with and tune until it suits your needs and your specific dataset. Good luck!
With that mean and sigma you are bound to sample some negative values. So i guess the option could be that you find the most negative value, after sampling, and add its absolute value to all the samples. After that convert to uint as suggested in the comments. But ofcourse you loose the mean this way.
If you have a range of uint16 numbers to sample from, then you should check out this post.
This way you could use scipy.stats.truncnorm to generate a gaussian of unsigned integers.

How to make a CDF in Python?

I made the PDF which is this hist code below;
plt.figure()
values1,bins1,_ = plt.hist(np.log10(fakeclusterlum),bins=20)
plt.hist(np.log10(bigclusterlum151mh),alpha = .5,bins = bins1)
but I am not sure how to plot this to make it into a CDF? I want to plot the fakeclusterlum and bigclusterlum151mh points. if that makes sense if it doesn't I apologise, I am somewhat of a beginner!
pyplot.hist has an argument
cumulative : bool, optional
If True, then a histogram is computed where each bin gives the counts in that bin plus all bins for smaller values. The last bin gives the total number of datapoints.
Default: False
Hence use
plt.hist(..., cumulative=True)
to plot a cumulative histogram.

Using pyplot to draw histogram

I have a list.
Index of list is degree number.
Value is the probability of this degree number.
It looks like, x[ 1 ] = 0.01 means, the degree 1 's probability is 0.01.
I want to draw a distribution graph of this list, and I try
hist = plt.figure(1)
plt.hist(PrDeg, bins = 1)
plt.title("Degree Probability Histogram")
plt.xlabel("Degree")
plt.ylabel("Prob.")
hist.savefig("Prob_Hist")
PrDeg is the list which i mention above.
But the saved figure is not correct.
The X axis value becomes to Prob. and Y is Degree ( Index of list )
How can I exchange x and y axis value by using pyplot ?
Histograms do not usually show you probabilities, they show the count or frequency of observations within different intervals of values, called bins. pyplot defines interval or bins by splitting the range between the minimum and maximum value of your array into n equally sized bins, where n is the number you specified with argument : bins = 1. So, in this case your histogram has a single bin which gives it its odd aspect. By increasing that number you will be able to better see what actually happens there.
The only information that we can get from such an histogram is that the values of your data range from 0.0 to ~0.122 and that len(PrDeg) is close to 1800. If I am right about that much, it means your graph looks like what one would expect from an histogram and it is therefore not incorrect.
To answer your question about swapping the axes, the argument orientation=u'horizontal' is what you are looking for. I used it in the example below, renaming the axes accordingly:
import numpy as np
import matplotlib.pyplot as plt
PrDeg = np.random.normal(0,1,10000)
print PrDeg
hist = plt.figure(1)
plt.hist(PrDeg, bins = 100, orientation=u'horizontal')
plt.title("Degree Probability Histogram")
plt.xlabel("count")
plt.ylabel("Values randomly generated by numpy")
hist.savefig("Prob_Hist")
plt.show()

How to best utilize the hist() to show a cumulative and normed histogram?

I have a problem while dealing with a data set which the value range from 0 to tens of thousand. And there is no problem to show the histogram of the whole data set using hist(). However, if I only want to show the cumulative and normed detailed histogram using say x = [0, 120], I have to use 600000 bins to assure the detail.
The tricky problem is if I just use the range of (0 ,120) to show normed and cumulative hist, it will end with 1. But actually it is far less than the real '1' since it just normed within this small range of data. Could anyone have some ideas how to utilize the hist() in matplotlib to tackle this problem? I thought this should not be so complicated that I have to write another function to draw the hist I need.
You can set bins to a list, not an integer, e.g., bins=[1,2,3,..,120,30000,60000].
To answer your commnet below, here is an excerpt from the documentation:
bins:
Either an integer number of bins or a sequence giving the bins. If bins is an integer, bins + 1 bin edges will be returned, consistent with numpy.histogram() for numpy version >= 1.3, and with the new = True argument in earlier versions. Unequally spaced bins are supported if bins is a sequence.
And here is an example with cumulative normalized histogram. Notice the effect of bins = [100,125,150,160,170,180,190,200,210,220,230,240,250,275,300] on this bar plot, how the first two bars are wider than the middle bars.
Hmmm, I guess this is related to your previous question (Memory error when dealing with huge data). My suggestion there doesn't seem to work for a cumulative histogram.
I can't get plt.hist() to play nice with cyborg's suggestion, so I did the cumsum and normalisation by hand:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import normal
inp = np.abs(normal(0, 100000, 100000))
bins = range(0, 120)
a,b = np.histogram(inp, bins = bins)
bar_edges = b[:-1]
bar_width = b[1] - b[0]
bar_height = (np.cumsum(a) + sum(inp<min(bins))) / len(inp)
plt.figure(1)
plt.bar(bar_edges, bar_height, width = bar_width)
plt.show()

Categories

Resources