Python 2d-histogram success rate per bin - python

I have data in arrays x, y and w where 'x' and 'y' indicate position and 'w' is a weight of either 1 or 0 indicating success or failure. I'm trying to create a 2d histogram where each bin of the histogram is coloured based on the percentage of successes in that bin (i.e. # of successes in bin divided by total points in bin). I've played around with numpy.histogram2d quite a bit and can get density plots going, but this is not the same as the % of success scheme I'm aiming for. Please note normed=True in the numpy.histogram2d argument does not alleviate this problem.
(To clarify on the difference, a density plot would indicate large 'colour value' if there is a larger number of successes in the bin regardless of how many failures are in the same bin. I would like to have the percentage of successes instead, so a large number of failures in the same bin would give a smaller 'colour value'. I apologise for poor terminology).
Thank you very much to anyone who can help!
Example of current code that doesn't do what I'm aiming for:
import matplotlib.pyplot as plt
import numpy as np
plt.figure(1)
H, xedges, yedges = np.histogram2d(x, y, bins=50, weights=w, normed=True)
extent = [yedges[0], yedges[-1], xedges[-1], xedges[0]]
plt.imshow(H, extent=extent,interpolation='nearest')
plt.colorbar()
plt.show()

I'm pretty sure this works, but you don't give data, so it's hard to check. normed=True gives you densities, if you don't pass normed=True, you get weighted sample counts, so if you divide your weighted version (which is really just #successes per bin) by unweighted (# of elements in each bin), you'll end up with % successes.
import matplotlib.pyplot as plt
import numpy as np
plt.figure(1)
H, xedges, yedges = np.histogram2d(x, y, bins=50, weights=w)
H2, _, _ = np.histogram2d(x,y, bins=50)
extent = [0,1, xedges[-1], xedges[0]]
plt.imshow(H/H2, extent=extent,interpolation='nearest')
plt.colorbar()
plt.show()
This could leave nan in the final histogram, so you might want to make a decision for those cases.

Related

Adding histogram bins together and plotting a figure

I have a histogram with 8192 bins, each bin imported from a line from a text file. To cut things short, it makes an awful fit and it was suggested to mee I could reduce the statistical errors by adding counts from adjacent bins. e.g. add bins 0-7 to make a new first bin, 8 times as wide, but 8 times(roughly) as high.
Ideally, would like to just be able to output a histogram of a binwidth controlled by a single constant in the code. However my attempts to do this, instead of producing something like the first image below (which was born of the version of my code which can only do a binwidth of 1, produce something like the second image below, missing fit lines and with a second empty graph in the same image file (born of my attempts to generalise the code for any bin width).
The following a histogram plotted directly from the original data i.e. binwidth = 1
Original code output, only works for binwidth 1 though.
Example for trying bin width 8 with come code modifications
I also need it to return a fit report, and the area under the gaussian as this is plotted later on in the code, in an exponential decay curve.
Here is the section of code I think is relevant:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Load text file
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure and label
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Photon coincidence detections from $β^+$ + $β^-$ annhilation', fontsize=18)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.bar(x, y)
ax.set_xlim(600,960)
## Adding Bins Together
y = y.astype(int)
x = x.astype(int)
## create the data
data = np.repeat(x, y)
## determine the range of x
x_range = range(min(data), max(data)+1)
## determine the length of x
x_len = len(x_range)
## plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
plt.show()
## given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
## determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
## result
print(result.fit_report())
fig.savefig("abw_" + finalprefix + ".png")
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)
## Plot decay curve
fig, ax = plt.subplots()
ax.errorbar(ts, amps, yerr= 2*np.array(ampserr), fmt="ko-", capsize = 5, capthick= 2, elinewidth=3, markersize=5)
plt.xlabel('Time', fontsize=14)
plt.ylabel('Peak amplitude', fontsize=14)
plt.title("Decay curve of P-31 by $β^+$ emission", fontsize=14)
Some synthetic data: {1,2,1,0,0,0,0,0,6,0,0,0,0,0,0,0,7,0,0,1,0,1,0,0,6,6,0,0,0,3,0,0,3,3,3,5,4,0,4,3,1,4,0,5,6,4,0,2,0,0,0,9,6,1,1,1,0,0,3,2,2,3,0,0,0,2,4,0,0,0,0,0,0,4,10,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0}
I think this should create 2 very different shaped histograms when the bin width is 1 and when it is 8. Though I have just made them up, the fit may not be good, and it is worth mentioning one of the problems I was having is related to being able to add together the information read in from the text file
In case it's useful:
-Here is the full original code
-Here is the data for that histogram

2d histogram: Get result of full nbins x nbins

I am using matplotlib's hist2d function to make a 2d histogram of data that I have, however I am having trouble interpreting the result.
Here is the plot I have:
This was created using the line:
hist = plt.hist2d(X, Y, (160,160), norm=mpl.colors.LogNorm(vmin=1, vmax=20))
This returns a 2d array of (160, 160), as well as the bin edges etc.
In the plot there are bins which have a high frequency of values (yellow bins). I would like to be able to get the results of this histogram and filter out the bins that have low values, preserving the high bins. But I would expect there to be 160*160 values, but I can only find 160 X and 160 Y values.
What I would like to do is essentially filter out the more dense data from the less dense data. If this means representing the data as a single value (a bin), then that is ok.
Am I misinterpreting the function or am I not accessing the data results correctly? I have tried with spicy also but the results seem to be in the same or similar format.
Not sure if this is what you wanted.
The hist2d docs specify that the function returns a tuple of size 4, where the first item h is a heatmap.
This h will have the same shape as bins.
You can capture the output (it will still plot), and use argwhere to find coordinates where values exceed, say, the 90th percentile:
h, xedges, yedges, img = hist = plt.hist2d(X, Y, bins=(160,160), norm=mpl.colors.LogNorm(vmin=1, vmax=20))
print(list(np.argwhere(h > np.percentile(h, 90))))
You need Seaborn package.
You mentioned
I would like to be able to get the results of this histogram and filter out the bins that have low values, preserving the high bins.
You should definitely be using one of those:
seaborn.joinplot(...,kind='hex') : it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large dataset.
seaborn.joinplot(...,kind='kde') : use the kernel density estimation to visualize a bivariate distribution. I recommed it better.
Example 'kde'
Use number of levels n_levels and shade_lowest=False to ignore low values.
import seaborn as sns
import numpy as np
import matplotlib.pylab as plt
x, y = np.random.randn(2, 300)
plt.figure(figsize=(6,5))
sns.kdeplot(x, y, zorder=0, n_levels=6, shade=True, cbar=True,
shade_lowest=False, cmap='viridis')

Make a 2d histogram show if a certain value is above or below average?

I made a 2d histogram of two variables(x and y) and each of them are long, 1d arrays. I then calculated the average of x in each bin and want to make the colorbar show how much each x is above or below average in the respective bin.
So far I have tried to make a new array, z, that contains the values for how far above/below average each x is. When I try to use this with pcolormesh I run into issues that it is not a 2-D array. I also tried to solve this issue by following the solution from this problem (Using pcolormesh with 3 one dimensional arrays in python). The length of each array (x, y and z) are equal in this case and there is a respective z value for each x value.
My overall goal is to just have the colorbar not dependent on counts but to have it show how much above/below average each x value is from the average x of the bin. I suspect that it may make more sense to just plot x vs. z but I do not think that would fix my colorbar issue.
As LoneWanderer mentioned some sample code would be useful; however let me make an attempt at what you want.
import numpy as np
import matplotlib.pyplot as plt
N = 10000
x = np.random.uniform(0, 1, N)
y = np.random.uniform(0, 1, N) # Generating x and y data (you will already have this)
# Histogram data
xbins = np.linspace(0, 1, 100)
ybins = np.linspace(0, 1, 100)
hdata, xedges, yedged = np.histogram2d(x, y, bins=(xbins, ybins))
# compute the histogram average value and the difference
hdataMean = np.mean(hdata)
hdataRelDifference = (hdata - hdataMean) / hdataMean
# Plot the relative difference
fig, ax = plt.subplots(1, 1)
cax = ax.imshow(hdataRelDifference)
fig.colorbar(cax, ax=ax)
If this is not what you intended, hopefully there are enough pieces here to adapt it for your needs.

Histogram with curve, representing histogram trend

I have plotted histogram and now I want to have curve which will represent the histogram trend. I want my histogram binning to be logarithmic (as I have below in the code; Mass variable is predefined variable, ranging from 10^43-10^45 gram).
I have looked for many many codes but could not suit any of them to my case (tried to modify as well). Do you know how I can make this curve? Actually, I just want to modify my code in the way that it will also include plotting this curve above the histogram.
Thanks,
Salome
See the attached image
import matplotlib.pyplot as plt
import numpy as np
x=Mass
hist, bins = np.histogram(x, bins=10)
logbins = np.logspace(np.log10(bins[0]),np.log10(bins[-1]),len(bins))
n, bins, patches = plt.hist(x=Mass, bins=logbins, color='#0504aa', alpha=0.8, rwidth=0.85)
plt.xscale('log')
plt.xlabel('Mass $(g)$ ')
plt.ylabel('Number of halos')
plt.show()

How can I plot ca. 20 million points as a scatterplot?

I am trying to create a scatterplot with matplotlib that consists of ca. ca. 20 million data points. Even after setting the alpha value to its lowest before ending up with no visible data at all the result is just a completely black plot.
plt.scatter(timedPlotData, plotData, alpha=0.01, marker='.')
The x-axis is a continuous timeline of about 2 months and the y-axis consists of 150k consecutive integer values.
Is there any way to plot all the points so that their distribution over time is still visible?
Thank you for your help.
There's more than one way to do this. A lot of folks have suggested a heatmap/kernel-density-estimate/2d-histogram. #Bucky suggesed using a moving average. In addition, you can fill between a moving min and moving max, and plot the moving mean over the top. I often call this a "chunkplot", but that's a terrible name. The implementation below assumes that your time (x) values are monotonically increasing. If they're not, it's simple enough to sort y by x before "chunking" in the chunkplot function.
Here are a couple of different ideas. Which is best will depend on what you want to emphasize in the plot. Note that this will be rather slow to run, but that's mostly due to the scatterplot. The other plotting styles are much faster.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
np.random.seed(1977)
def main():
x, y = generate_data()
fig, axes = plt.subplots(nrows=3, sharex=True)
for ax in axes.flat:
ax.xaxis_date()
fig.autofmt_xdate()
axes[0].set_title('Scatterplot of all data')
axes[0].scatter(x, y, marker='.')
axes[1].set_title('"Chunk" plot of data')
chunkplot(x, y, chunksize=1000, ax=axes[1],
edgecolor='none', alpha=0.5, color='gray')
axes[2].set_title('Hexbin plot of data')
axes[2].hexbin(x, y)
plt.show()
def generate_data():
# Generate a very noisy but interesting timeseries
x = mdates.drange(dt.datetime(2010, 1, 1), dt.datetime(2013, 9, 1),
dt.timedelta(minutes=10))
num = x.size
y = np.random.random(num) - 0.5
y.cumsum(out=y)
y += 0.5 * y.max() * np.random.random(num)
return x, y
def chunkplot(x, y, chunksize, ax=None, line_kwargs=None, **kwargs):
if ax is None:
ax = plt.gca()
if line_kwargs is None:
line_kwargs = {}
# Wrap the array into a 2D array of chunks, truncating the last chunk if
# chunksize isn't an even divisor of the total size.
# (This part won't use _any_ additional memory)
numchunks = y.size // chunksize
ychunks = y[:chunksize*numchunks].reshape((-1, chunksize))
xchunks = x[:chunksize*numchunks].reshape((-1, chunksize))
# Calculate the max, min, and means of chunksize-element chunks...
max_env = ychunks.max(axis=1)
min_env = ychunks.min(axis=1)
ycenters = ychunks.mean(axis=1)
xcenters = xchunks.mean(axis=1)
# Now plot the bounds and the mean...
fill = ax.fill_between(xcenters, min_env, max_env, **kwargs)
line = ax.plot(xcenters, ycenters, **line_kwargs)[0]
return fill, line
main()
For each day, tally up the frequency of each value (a collections.Counter will do this nicely), then plot a heatmap of the values, one per day. For publication, use a grayscale for the heatmap colors.
My recommendation would be to use a sorting and moving average algorithm on the raw data before you plot it. This should leave the mean and trend intact over the time period of interest while providing you with a reduction in clutter on the plot.
Group values into bands on each day and use a 3d histogram of count, value band, day.
That way you can get the number of occurrences in a given band on each day clearly.

Categories

Resources