A mathematical explanation for why variance of bootstrap estimates decreases - python

I am trying to grok bootstrapping and bagging (bootstrap aggregation), so I've been attempting to perform some experiments. I loaded in a sample dataset from Kaggle and attempted to use the bootstrapping method:
X = pd.read_csv("dataset.csv")
true_median = np.median(X["Impressions"])
B = 500
errors = []
variances = []
for b in range(1, B):
sample_medians = [np.median(X.sample(len(X), replace=True)["Impressions"]) for i in range(b)]
error = np.mean(sample_medians) - true_median
variances.append(np.std(sample_medians) ** 2)
errors.append(error)
Then I visualized errors and variances:
fig, ax1 = plt.subplots()
color = 'tab:red'
ax1.set_xlabel('Number of Bootstrap Samples (B)')
ax1.set_ylabel('Bootstrap Estimate Error', color=color)
ax1.plot(errors, color=color, alpha=0.7)
ax1.tick_params(axis='y', labelcolor=color)
ax2 = ax1.twinx()
color = 'tab:blue'
ax2.set_ylabel('Bootstrap Estimate Variance', color=color)
ax2.plot(variances, color=color, alpha=0.7)
ax2.tick_params(axis='y', labelcolor=color)
fig.tight_layout()
plt.title("Relationship Between Bootstrap Error, Variance \nand Number of Bootstrap Iterations")
plt.show()
This is the output of the plot:
You can see that both the error and the variance decrease as B increases.
I'm trying to find some sort of mathematical justification - is there a way to derive or prove why the variance of bootstrap estimates decreases when B increases?

I think what you are seeing is Central-Limit Theorem in play. When the loop starts, the number of samples from the population with replacement is small and the mean of medians (you call this error) is not representative of reaching the true population median. As you generate more samples, the mean of medians converge to true median asymptotically. As the convergence happens towards true mean, the samples from this distribution are not far enough to generate a large variance and it also reaches convergence.
Did that clarify? If not please elaborate what you are expecting to see while plotting them and we can discuss about how to get there.

Related

PyPlot Bar chart shows non-existent values?

I need some help with a pyplot bar chart that isn't doing what it should, and I cannot figure out why.
So basically what I need to do is draw the power function of a binomial distribution test. First I plot the binomial distribution and mark important values.
from scipy.stats import binom
import numpy as np
import matplotlib.pyplot as plt
n = 20
p = 1/2
x_values = list(range(n + 1))
prob = [binom.pmf(x, n, p) for x in x_values ]
cumult = 0
index_count = 0
for px in prob:
cumult += px
print(cumult)
if cumult > 0.1:
print(index_count-1)
break
else:
index_count = index_count + 1
plt.bar(x_values,prob)
plt.axvline(x=6, color='red', linestyle='-', label='Grenze')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Binomial distribution plot
So far so good. Looks exactly like it should. Now for the power function what I do is add up the single probabilities from prob, and for each one, I calculate their probability of failing the test. Now the graph for this should look something like this for example
Example Graph
(ofc as a bar chart in my case)
Yet, my code
p_values = []
err_p = []
cumul = 0
for p in prob:
cumul = cumul + p
p_values.append(cumul)
err_p.append(1-cumul)
x_pos = np.arange(len(p_values))
plt.bar(p_values, err_p)
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Produces this weird bar chart
which has values in the negatives and over 1 on the x-axis even though there are no values like this in the data??? I know that it worked once before I marked the values in this chart as well, but I haven't been able to reproduce it. I always get the one with non-existent values. I also don't know if it may have to do with the weirdly wide bars since in the first graph they look normal but here they sort of flow into each other.
For your task, you don't want to use a bar plot but a step plot:
plt.step(x=p_values, y=err_p, where="mid", label="err")
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Sample output:
Bars have usually a constant width, hence they will leak into x-data that are not actually in your dataset. You could manually calculate the necessary width of each bar but thankfully matplotlib has implemented the step function for this task.
If you wanted a filled plot like a histogram, you could use fill_between:
plt.fill_between(x=p_values, y1=err_p, step="mid", color="lightblue", label="err")
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Sample output:

How do I discretize a continuous function avoiding noise generation (see picture)

I have a continuous input function which I would like to discretize into lets say 5-10 discrete bins between 1 and 0. Right now I am using np.digitize and rescale the output bins to 0-1. Now the problem is that sometime datasets (blue line) yield results like this:
I tried pushing up the number of discretization bins but I ended up keeping the same noise and getting just more increments. As an example where the algorithm worked with the same settings but another dataset:
this is the code I used there NumOfDisc = number of bins
intervals = np.linspace(0,1,NumOfDisc)
discretized_Array = np.digitize(Continuous_Array, intervals)
The red ilne in the graph is not important. The continuous blue line is the on I try to discretize and the green line is the discretized result.The Graphs are created with matplotlyib.pyplot using the following code:
def CheckPlots(discretized_Array, Continuous_Array, Temperature, time, PlotName)
logging.info("Plotting...")
#Setting Axis properties and titles
fig, ax = plt.subplots(1, 1)
ax.set_title(PlotName)
ax.set_ylabel('Temperature [°C]')
ax.set_ylim(40, 110)
ax.set_xlabel('Time [s]')
ax.grid(b=True, which="both")
ax2=ax.twinx()
ax2.set_ylabel('DC Power [%]')
ax2.set_ylim(-1.5,3.5)
#Plotting stuff
ax.plot(time, Temperature, label= "Input Temperature", color = '#c70e04')
ax2.plot(time, Continuous_Array, label= "Continuous Power", color = '#040ec7')
ax2.plot(time, discretized_Array, label= "Discrete Power", color = '#539600')
fig.legend(loc = "upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
logging.info("Done!")
logging.info("---")
return
Any Ideas what I could do to get sensible discretizations like in the second case?
The following solution gives the exact result you need.
Basically, the algorithm finds an ideal line, and attempts to replicate it as well as it can with less datapoints. It starts with 2 points at the edges (straight line), then adds one in the center, then checks which side has the greatest error, and adds a point in the center of that, and so on, until it reaches the desired bin count. Simple :)
import warnings
warnings.simplefilter('ignore', np.RankWarning)
def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
"""Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
xs = np.linspace(x0, x1, num=integral_points)
ys = straight_line(xs)
perfect_ys = ideal_line(xs)
err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0) # Remove (x1 - x0) to only look at avg errors
return err
def discretize_bisect(xs, ys, bin_count):
"""Returns xs and ys of discrete points"""
# For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
# If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
new_xs = [xs[0], xs[-1]]
new_ys = [ys[0], ys[-1]]
while len(new_xs) < bin_count:
errors = []
for i in range(len(new_xs)-1):
err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
errors.append(err)
max_segment_id = np.argmax(errors)
new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
new_y = ideal_line(new_x)
new_xs.insert(max_segment_id+1, new_x)
new_ys.insert(max_segment_id+1, new_y)
return new_xs, new_ys
BIN_COUNT = 25
new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)
plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))
Moreover, here's my simplified plotting function I tested with.
def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
"""A simplified version of the provided plotting function"""
# Setting Axis properties and titles
fig, ax = plt.subplots(figsize=(20, 4))
ax.set_title(plot_name)
ax.set_xlabel('Time [s]')
ax.set_ylabel('DC Power [%]')
# Plotting stuff
ax.plot(cont_time, cont_array, label="Continuous Power", color='#0000ff')
ax.plot(disc_time, disc_array, label="Discrete Power", color='#00ff00')
fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
Lastly, here's the Google Colab
If what I described in the comments is the problem, there are a few options to deal with this:
Do nothing: Depending on the reason you're discretizing, you might want the discrete values to reflect the continuous values accurately
Change the bins: you could shift the bins or change the number of bins, such that relatively 'flat' parts of the blue line stay within one bin, thus giving a flat green line in these parts as well, which would be visually more pleasing like in your second plot.

Python: how to fit a gamma distribution from data?

I have a dataset and I am trying to see which is the best distribution its following.
In the firs attempt I tried to fit it with a rayleigh, so
y, x = np.histogram(data, bins=45, normed=True)
param = rayleigh.fit(y) # distribution fitting
# fitted distribution
xx = linspace(0,45,1000)
pdf_fitted = rayleigh.pdf(xx,loc=param[0],scale=param[1])
pdf = rayleigh.pdf(xx,loc=0,scale=8.5)
fig,ax = plt.subplots(figsize=(7,5))
plot(xx,pdf,'r-', lw=5, alpha=0.6, label='rayleigh pdf')
plot(xx,pdf,'k-', label='Data')
plt.bar(x[1:], y)
ax.set_xlabel('Distance, '+r'$x [km]$',size = 15)
ax.set_ylabel('Frequency, '+r'$P(x)$',size=15)
ax.legend(loc='best', frameon=False)
I am trying to do the same with a gamma distribution without succeding
y, x = np.histogram(net1['distance'], bins=45, normed=True)
xx = linspace(0,45,1000)
ag,bg,cg = gamma.fit(y)
pdf_gamma = gamma.pdf(xx, ag, bg,cg)
fig,ax = plt.subplots(figsize=(7,5))
# fitted distribution
plot(xx,pdf_gamma,'r-', lw=5, alpha=0.6, label='gamma pdf')
plot(xx,pdf_gamma,'k-')
plt.bar(x[1:], y, label='Data')
ax.set_xlabel('Distance, '+r'$x [km]$',size = 15)
ax.set_ylabel('Frequency, '+r'$P(x)$',size=15)
ax.legend(loc='best', frameon=False)
Unfortunately scipy.stats.gamma is not well documented.
suppose you have some "raw" data in the form data=array([a1,a2,a3,.....]), these can be the results of an experiment of yours.
You can give these raw values to the fit method: gamma.fit(data) and it will return for you three parameters a,b,c = gamma.fit(data). These are the "shape", the "loc"ation and the "scale" of the gamma curve that fits better the DISTRIBUTION HISTOGRAM of your data (not the actual data).
I noticed from the questions online that many people confuse. They have a distribution of data, and try to fit it with gamma.fit. This is wrong.
The method gamma.fit expects your raw data, not the distribution of your data.
This will presumably solve problems to few of us.
GR
My guess is that you have much of the original data at 0, so the alpha of the fit ends up lower than 1 (0.34) and you get the decreasing shape with singularity at 0. The bar plot does not include the zero (x[1:]) so you don't see the huge bar on the left.
Can I be right?

How to compute residuals of a point process in python

I am trying to reproduce the work from http://jheusser.github.io/2013/09/08/hawkes.html in python except with different data. I have written code to simulate a Poisson process as well as the Hawkes process they describe.
To do the Hawkes model MLE I define the log likelihood function as
def loglikelihood(params, data):
(mu, alpha, beta) = params
tlist = np.array(data)
r = np.zeros(len(tlist))
for i in xrange(1,len(tlist)):
r[i] = math.exp(-beta*(tlist[i]-tlist[i-1]))*(1+r[i-1])
loglik = -tlist[-1]*mu
loglik = loglik+alpha/beta*sum(np.exp(-beta*(tlist[-1]-tlist))-1)
loglik = loglik+np.sum(np.log(mu+alpha*r))
return -loglik
Using some dummy data, we can compute the MLE for the Hawkes process with
atimes=[58.98353497, 59.28420225, 59.71571013, 60.06750179, 61.24794134,
61.70692463, 61.73611983, 62.28593814, 62.51691723, 63.17370423
,63.20125152, 65.34092403, 214.24934446, 217.0390236, 312.18830525,
319.38385604, 320.31758188, 323.50201334, 323.76801537, 323.9417007]
res = minimize(loglikelihood, (0.01, 0.1,0.1),method='Nelder-Mead',args = (atimes,))
print res
However, I don't know how to do the following things in python.
How can I do the equivalent of evalCIF to get a similar fitted versus empirical intensities plot as they have?
How can I compute the residuals for the Hawkes model to make the equivalent of the QQ plot they have. They say they use an R package called ptproc but I can't find a python equivalent.
OK, so first thing that you may wish to do is to plot the data. To keep it simple I've reproduced this figure as it only has 8 events occurring so it's easy to see the behaviour of the system. The following code:
import numpy as np
import math, matplotlib
import matplotlib.pyplot
import matplotlib.lines
mu = 0.1 # Parameter values as found in the article http://jheusser.github.io/2013/09/08/hawkes.html Hawkes Process section.
alpha = 1.0
beta = 0.5
EventTimes = np.array([0.7, 1.2, 2.0, 3.8, 7.1, 8.2, 8.9, 9.0])
" Compute conditional intensities for all times using the Hawkes process. "
timesOfInterest = np.linspace(0.0, 10.0, 100) # Times where the intensity will be sampled.
conditionalIntensities = [] # Conditional intensity for every epoch of interest.
for t in timesOfInterest:
conditionalIntensities.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in EventTimes] ).sum() ) # Find the contributions of all preceding events to the overall chance of another one occurring. All events that occur after t have no contribution.
" Plot the conditional intensity time history. "
fig = matplotlib.pyplot.figure()
ax = fig.gca()
labelsFontSize = 16
ticksFontSize = 14
fig.suptitle(r"$Conditional\ intensity\ VS\ time$", fontsize=20)
ax.grid(True)
ax.set_xlabel(r'$Time$',fontsize=labelsFontSize)
ax.set_ylabel(r'$\lambda$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
eventsScatter = ax.scatter(EventTimes,np.ones(len(EventTimes))) # Just to indicate where the events took place.
ax.plot(timesOfInterest, conditionalIntensities, color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fittedPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fig.legend([fittedPlot, eventsScatter], [r'$Conditional\ intensity\ computed\ from\ events$', r'$Events$'])
matplotlib.pyplot.show()
reproduces the figure pretty accurately, even though I've chosen the event epochs somewhat arbitrarily:
This can also be applied to the set of example set of data of 5000 trades by binning the data and treating every bin as an event. However, what happens now, every event has a slightly different weight as different number of trades occurs in every bin.
This is also mentioned in the article in Fitting Bitcoin Trade Arrival to a Hawkes Process section with a proposed way to overcome this problem: The only difference to the original dataset is that I added a random millisecond timestamp to all trades that share a timestamp with another trade. This is required as the model requires to distinguish every trade (i.e. every trade must have a unique timestamp). This is incorporated in the following code:
import numpy as np
import math, matplotlib, pandas
import scipy.optimize
import matplotlib.pyplot
import matplotlib.lines
" Read example trades' data. "
all_trades = pandas.read_csv('all_trades.csv', parse_dates=[0], index_col=0) # All trades' data.
all_counts = pandas.DataFrame({'counts': np.ones(len(all_trades))}, index=all_trades.index) # Only the count of the trades is really important.
empirical_1min = all_counts.resample('1min', how='sum') # Bin the data so find the number of trades in 1 minute intervals.
baseEventTimes = np.array( range(len(empirical_1min.values)), dtype=np.float64) # Dummy times when the events take place, don't care too much about actual epochs where the bins are placed - this could be scaled to days since epoch, second since epoch and any other measure of time.
eventTimes = [] # With the event batches split into separate events.
for i in range(len(empirical_1min.values)): # Deal with many events occurring at the same time - need to distinguish between them by splitting each batch of events into distinct events taking place at almost the same time.
if not np.isnan(empirical_1min.values[i]):
for j in range(empirical_1min.values[i]):
eventTimes.append(baseEventTimes[i]+0.000001*(j+1)) # For every event that occurrs at this epoch enter a dummy event very close to it in time that will increase the conditional intensity.
eventTimes = np.array( eventTimes, dtype=np.float64 ) # Change to array for ease of operations.
" Find a fit for alpha, beta, and mu that minimises loglikelihood for the input data. "
#res = scipy.optimize.minimize(loglikelihood, (0.01, 0.1,0.1), method='Nelder-Mead', args = (eventTimes,))
#(mu, alpha, beta) = res.x
mu = 0.07 # Parameter values as found in the article.
alpha = 1.18
beta = 1.79
" Compute conditional intensities for all epochs using the Hawkes process - add more points to see how the effect of individual events decays over time. "
conditionalIntensitiesPlotting = [] # Conditional intensity for every epoch of interest.
timesOfInterest = np.linspace(eventTimes.min(), eventTimes.max(), eventTimes.size*10) # Times where the intensity will be sampled. Sample at much higher frequency than the events occur at.
for t in timesOfInterest:
conditionalIntensitiesPlotting.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in eventTimes] ).sum() ) # Find the contributions of all preceding events to the overall chance of another one occurring. All events that occur after time of interest t have no contribution.
" Compute conditional intensities at the same epochs as the empirical data are known. "
conditionalIntensities=[] # This will be used in the QQ plot later, has to have the same size as the empirical data.
for t in np.linspace(eventTimes.min(), eventTimes.max(), eventTimes.size):
conditionalIntensities.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in eventTimes] ).sum() ) # Use eventTimes here as well to feel the influence of all the events that happen at the same time.
" Plot the empirical and fitted datasets. "
fig = matplotlib.pyplot.figure()
ax = fig.gca()
labelsFontSize = 16
ticksFontSize = 14
fig.suptitle(r"$Conditional\ intensity\ VS\ time$", fontsize=20)
ax.grid(True)
ax.set_xlabel(r'$Time$',fontsize=labelsFontSize)
ax.set_ylabel(r'$\lambda$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
# Plot the empirical binned data.
ax.plot(baseEventTimes,empirical_1min.values, color='blue', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
empiricalPlot = matplotlib.lines.Line2D([],[],color='blue', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
# And the fit obtained using the Hawkes function.
ax.plot(timesOfInterest, conditionalIntensitiesPlotting, color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fittedPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fig.legend([fittedPlot, empiricalPlot], [r'$Fitted\ data$', r'$Empirical\ data$'])
matplotlib.pyplot.show()
This generates the following fit to the plot:
All looking good but, when you look at the detail, you'll see that computing the residuals by simply taking one vector of the number of trades and subtracting the fitted one won't do since they have different lengths:
It is possible, however, to extract the intensity at the same epochs as when it was recorded for the empirical data and then compute the residuals. This enables you to find quantiles of both empirical and fitted data and plot them against each other thus generating the QQ plot:
""" GENERATE THE QQ PLOT. """
" Process the data and compute the quantiles. "
orderStatistics=[]; orderStatistics2=[];
for i in range( empirical_1min.values.size ): # Make sure all the NANs are filtered out and both arrays have the same size.
if not np.isnan( empirical_1min.values[i] ):
orderStatistics.append(empirical_1min.values[i])
orderStatistics2.append(conditionalIntensities[i])
orderStatistics = np.array(orderStatistics); orderStatistics2 = np.array(orderStatistics2);
orderStatistics.sort(axis=0) # Need to sort data in ascending order to make a QQ plot. orderStatistics is a column vector.
orderStatistics2.sort()
smapleQuantiles=np.zeros( orderStatistics.size ) # Quantiles of the empirical data.
smapleQuantiles2=np.zeros( orderStatistics2.size ) # Quantiles of the data fitted using the Hawkes process.
for i in range( orderStatistics.size ):
temp = int( 100*(i-0.5)/float(smapleQuantiles.size) ) # (i-0.5)/float(smapleQuantiles.size) th quantile. COnvert to % as expected by the numpy function.
if temp<0.0:
temp=0.0 # Avoid having -ve percentiles.
smapleQuantiles[i] = np.percentile(orderStatistics, temp)
smapleQuantiles2[i] = np.percentile(orderStatistics2, temp)
" Make the quantile plot of empirical data first. "
fig2 = matplotlib.pyplot.figure()
ax2 = fig2.gca(aspect="equal")
fig2.suptitle(r"$Quantile\ plot$", fontsize=20)
ax2.grid(True)
ax2.set_xlabel(r'$Sample\ fraction\ (\%)$',fontsize=labelsFontSize)
ax2.set_ylabel(r'$Observations$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
distScatter = ax2.scatter(smapleQuantiles, orderStatistics, c='blue', marker='o') # If these are close to the straight line with slope line these points come from a normal distribution.
ax2.plot(smapleQuantiles, smapleQuantiles, color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
normalDistPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
fig2.legend([normalDistPlot, distScatter], [r'$Normal\ distribution$', r'$Empirical\ data$'])
matplotlib.pyplot.show()
" Make a QQ plot. "
fig3 = matplotlib.pyplot.figure()
ax3 = fig3.gca(aspect="equal")
fig3.suptitle(r"$Quantile\ -\ Quantile\ plot$", fontsize=20)
ax3.grid(True)
ax3.set_xlabel(r'$Empirical\ data$',fontsize=labelsFontSize)
ax3.set_ylabel(r'$Data\ fitted\ with\ Hawkes\ distribution$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
distributionScatter = ax3.scatter(smapleQuantiles, smapleQuantiles2, c='blue', marker='x') # If these are close to the straight line with slope line these points come from a normal distribution.
ax3.plot(smapleQuantiles, smapleQuantiles, color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
normalDistPlot2 = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
fig3.legend([normalDistPlot2, distributionScatter], [r'$Normal\ distribution$', r'$Comparison\ of\ datasets$'])
matplotlib.pyplot.show()
This generates the following plots:
The quantile plot of empirical data isn't exactly the same as in the article, I'm not sure why as I'm not great with statistics. But, from programming standpoint, this is how you can go about all this.

Python 2d-histogram success rate per bin

I have data in arrays x, y and w where 'x' and 'y' indicate position and 'w' is a weight of either 1 or 0 indicating success or failure. I'm trying to create a 2d histogram where each bin of the histogram is coloured based on the percentage of successes in that bin (i.e. # of successes in bin divided by total points in bin). I've played around with numpy.histogram2d quite a bit and can get density plots going, but this is not the same as the % of success scheme I'm aiming for. Please note normed=True in the numpy.histogram2d argument does not alleviate this problem.
(To clarify on the difference, a density plot would indicate large 'colour value' if there is a larger number of successes in the bin regardless of how many failures are in the same bin. I would like to have the percentage of successes instead, so a large number of failures in the same bin would give a smaller 'colour value'. I apologise for poor terminology).
Thank you very much to anyone who can help!
Example of current code that doesn't do what I'm aiming for:
import matplotlib.pyplot as plt
import numpy as np
plt.figure(1)
H, xedges, yedges = np.histogram2d(x, y, bins=50, weights=w, normed=True)
extent = [yedges[0], yedges[-1], xedges[-1], xedges[0]]
plt.imshow(H, extent=extent,interpolation='nearest')
plt.colorbar()
plt.show()
I'm pretty sure this works, but you don't give data, so it's hard to check. normed=True gives you densities, if you don't pass normed=True, you get weighted sample counts, so if you divide your weighted version (which is really just #successes per bin) by unweighted (# of elements in each bin), you'll end up with % successes.
import matplotlib.pyplot as plt
import numpy as np
plt.figure(1)
H, xedges, yedges = np.histogram2d(x, y, bins=50, weights=w)
H2, _, _ = np.histogram2d(x,y, bins=50)
extent = [0,1, xedges[-1], xedges[0]]
plt.imshow(H/H2, extent=extent,interpolation='nearest')
plt.colorbar()
plt.show()
This could leave nan in the final histogram, so you might want to make a decision for those cases.

Categories

Resources