How to compute residuals of a point process in python - python

I am trying to reproduce the work from http://jheusser.github.io/2013/09/08/hawkes.html in python except with different data. I have written code to simulate a Poisson process as well as the Hawkes process they describe.
To do the Hawkes model MLE I define the log likelihood function as
def loglikelihood(params, data):
(mu, alpha, beta) = params
tlist = np.array(data)
r = np.zeros(len(tlist))
for i in xrange(1,len(tlist)):
r[i] = math.exp(-beta*(tlist[i]-tlist[i-1]))*(1+r[i-1])
loglik = -tlist[-1]*mu
loglik = loglik+alpha/beta*sum(np.exp(-beta*(tlist[-1]-tlist))-1)
loglik = loglik+np.sum(np.log(mu+alpha*r))
return -loglik
Using some dummy data, we can compute the MLE for the Hawkes process with
atimes=[58.98353497, 59.28420225, 59.71571013, 60.06750179, 61.24794134,
61.70692463, 61.73611983, 62.28593814, 62.51691723, 63.17370423
,63.20125152, 65.34092403, 214.24934446, 217.0390236, 312.18830525,
319.38385604, 320.31758188, 323.50201334, 323.76801537, 323.9417007]
res = minimize(loglikelihood, (0.01, 0.1,0.1),method='Nelder-Mead',args = (atimes,))
print res
However, I don't know how to do the following things in python.
How can I do the equivalent of evalCIF to get a similar fitted versus empirical intensities plot as they have?
How can I compute the residuals for the Hawkes model to make the equivalent of the QQ plot they have. They say they use an R package called ptproc but I can't find a python equivalent.

OK, so first thing that you may wish to do is to plot the data. To keep it simple I've reproduced this figure as it only has 8 events occurring so it's easy to see the behaviour of the system. The following code:
import numpy as np
import math, matplotlib
import matplotlib.pyplot
import matplotlib.lines
mu = 0.1 # Parameter values as found in the article http://jheusser.github.io/2013/09/08/hawkes.html Hawkes Process section.
alpha = 1.0
beta = 0.5
EventTimes = np.array([0.7, 1.2, 2.0, 3.8, 7.1, 8.2, 8.9, 9.0])
" Compute conditional intensities for all times using the Hawkes process. "
timesOfInterest = np.linspace(0.0, 10.0, 100) # Times where the intensity will be sampled.
conditionalIntensities = [] # Conditional intensity for every epoch of interest.
for t in timesOfInterest:
conditionalIntensities.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in EventTimes] ).sum() ) # Find the contributions of all preceding events to the overall chance of another one occurring. All events that occur after t have no contribution.
" Plot the conditional intensity time history. "
fig = matplotlib.pyplot.figure()
ax = fig.gca()
labelsFontSize = 16
ticksFontSize = 14
fig.suptitle(r"$Conditional\ intensity\ VS\ time$", fontsize=20)
ax.grid(True)
ax.set_xlabel(r'$Time$',fontsize=labelsFontSize)
ax.set_ylabel(r'$\lambda$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
eventsScatter = ax.scatter(EventTimes,np.ones(len(EventTimes))) # Just to indicate where the events took place.
ax.plot(timesOfInterest, conditionalIntensities, color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fittedPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fig.legend([fittedPlot, eventsScatter], [r'$Conditional\ intensity\ computed\ from\ events$', r'$Events$'])
matplotlib.pyplot.show()
reproduces the figure pretty accurately, even though I've chosen the event epochs somewhat arbitrarily:
This can also be applied to the set of example set of data of 5000 trades by binning the data and treating every bin as an event. However, what happens now, every event has a slightly different weight as different number of trades occurs in every bin.
This is also mentioned in the article in Fitting Bitcoin Trade Arrival to a Hawkes Process section with a proposed way to overcome this problem: The only difference to the original dataset is that I added a random millisecond timestamp to all trades that share a timestamp with another trade. This is required as the model requires to distinguish every trade (i.e. every trade must have a unique timestamp). This is incorporated in the following code:
import numpy as np
import math, matplotlib, pandas
import scipy.optimize
import matplotlib.pyplot
import matplotlib.lines
" Read example trades' data. "
all_trades = pandas.read_csv('all_trades.csv', parse_dates=[0], index_col=0) # All trades' data.
all_counts = pandas.DataFrame({'counts': np.ones(len(all_trades))}, index=all_trades.index) # Only the count of the trades is really important.
empirical_1min = all_counts.resample('1min', how='sum') # Bin the data so find the number of trades in 1 minute intervals.
baseEventTimes = np.array( range(len(empirical_1min.values)), dtype=np.float64) # Dummy times when the events take place, don't care too much about actual epochs where the bins are placed - this could be scaled to days since epoch, second since epoch and any other measure of time.
eventTimes = [] # With the event batches split into separate events.
for i in range(len(empirical_1min.values)): # Deal with many events occurring at the same time - need to distinguish between them by splitting each batch of events into distinct events taking place at almost the same time.
if not np.isnan(empirical_1min.values[i]):
for j in range(empirical_1min.values[i]):
eventTimes.append(baseEventTimes[i]+0.000001*(j+1)) # For every event that occurrs at this epoch enter a dummy event very close to it in time that will increase the conditional intensity.
eventTimes = np.array( eventTimes, dtype=np.float64 ) # Change to array for ease of operations.
" Find a fit for alpha, beta, and mu that minimises loglikelihood for the input data. "
#res = scipy.optimize.minimize(loglikelihood, (0.01, 0.1,0.1), method='Nelder-Mead', args = (eventTimes,))
#(mu, alpha, beta) = res.x
mu = 0.07 # Parameter values as found in the article.
alpha = 1.18
beta = 1.79
" Compute conditional intensities for all epochs using the Hawkes process - add more points to see how the effect of individual events decays over time. "
conditionalIntensitiesPlotting = [] # Conditional intensity for every epoch of interest.
timesOfInterest = np.linspace(eventTimes.min(), eventTimes.max(), eventTimes.size*10) # Times where the intensity will be sampled. Sample at much higher frequency than the events occur at.
for t in timesOfInterest:
conditionalIntensitiesPlotting.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in eventTimes] ).sum() ) # Find the contributions of all preceding events to the overall chance of another one occurring. All events that occur after time of interest t have no contribution.
" Compute conditional intensities at the same epochs as the empirical data are known. "
conditionalIntensities=[] # This will be used in the QQ plot later, has to have the same size as the empirical data.
for t in np.linspace(eventTimes.min(), eventTimes.max(), eventTimes.size):
conditionalIntensities.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in eventTimes] ).sum() ) # Use eventTimes here as well to feel the influence of all the events that happen at the same time.
" Plot the empirical and fitted datasets. "
fig = matplotlib.pyplot.figure()
ax = fig.gca()
labelsFontSize = 16
ticksFontSize = 14
fig.suptitle(r"$Conditional\ intensity\ VS\ time$", fontsize=20)
ax.grid(True)
ax.set_xlabel(r'$Time$',fontsize=labelsFontSize)
ax.set_ylabel(r'$\lambda$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
# Plot the empirical binned data.
ax.plot(baseEventTimes,empirical_1min.values, color='blue', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
empiricalPlot = matplotlib.lines.Line2D([],[],color='blue', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
# And the fit obtained using the Hawkes function.
ax.plot(timesOfInterest, conditionalIntensitiesPlotting, color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fittedPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fig.legend([fittedPlot, empiricalPlot], [r'$Fitted\ data$', r'$Empirical\ data$'])
matplotlib.pyplot.show()
This generates the following fit to the plot:
All looking good but, when you look at the detail, you'll see that computing the residuals by simply taking one vector of the number of trades and subtracting the fitted one won't do since they have different lengths:
It is possible, however, to extract the intensity at the same epochs as when it was recorded for the empirical data and then compute the residuals. This enables you to find quantiles of both empirical and fitted data and plot them against each other thus generating the QQ plot:
""" GENERATE THE QQ PLOT. """
" Process the data and compute the quantiles. "
orderStatistics=[]; orderStatistics2=[];
for i in range( empirical_1min.values.size ): # Make sure all the NANs are filtered out and both arrays have the same size.
if not np.isnan( empirical_1min.values[i] ):
orderStatistics.append(empirical_1min.values[i])
orderStatistics2.append(conditionalIntensities[i])
orderStatistics = np.array(orderStatistics); orderStatistics2 = np.array(orderStatistics2);
orderStatistics.sort(axis=0) # Need to sort data in ascending order to make a QQ plot. orderStatistics is a column vector.
orderStatistics2.sort()
smapleQuantiles=np.zeros( orderStatistics.size ) # Quantiles of the empirical data.
smapleQuantiles2=np.zeros( orderStatistics2.size ) # Quantiles of the data fitted using the Hawkes process.
for i in range( orderStatistics.size ):
temp = int( 100*(i-0.5)/float(smapleQuantiles.size) ) # (i-0.5)/float(smapleQuantiles.size) th quantile. COnvert to % as expected by the numpy function.
if temp<0.0:
temp=0.0 # Avoid having -ve percentiles.
smapleQuantiles[i] = np.percentile(orderStatistics, temp)
smapleQuantiles2[i] = np.percentile(orderStatistics2, temp)
" Make the quantile plot of empirical data first. "
fig2 = matplotlib.pyplot.figure()
ax2 = fig2.gca(aspect="equal")
fig2.suptitle(r"$Quantile\ plot$", fontsize=20)
ax2.grid(True)
ax2.set_xlabel(r'$Sample\ fraction\ (\%)$',fontsize=labelsFontSize)
ax2.set_ylabel(r'$Observations$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
distScatter = ax2.scatter(smapleQuantiles, orderStatistics, c='blue', marker='o') # If these are close to the straight line with slope line these points come from a normal distribution.
ax2.plot(smapleQuantiles, smapleQuantiles, color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
normalDistPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
fig2.legend([normalDistPlot, distScatter], [r'$Normal\ distribution$', r'$Empirical\ data$'])
matplotlib.pyplot.show()
" Make a QQ plot. "
fig3 = matplotlib.pyplot.figure()
ax3 = fig3.gca(aspect="equal")
fig3.suptitle(r"$Quantile\ -\ Quantile\ plot$", fontsize=20)
ax3.grid(True)
ax3.set_xlabel(r'$Empirical\ data$',fontsize=labelsFontSize)
ax3.set_ylabel(r'$Data\ fitted\ with\ Hawkes\ distribution$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
distributionScatter = ax3.scatter(smapleQuantiles, smapleQuantiles2, c='blue', marker='x') # If these are close to the straight line with slope line these points come from a normal distribution.
ax3.plot(smapleQuantiles, smapleQuantiles, color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
normalDistPlot2 = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
fig3.legend([normalDistPlot2, distributionScatter], [r'$Normal\ distribution$', r'$Comparison\ of\ datasets$'])
matplotlib.pyplot.show()
This generates the following plots:
The quantile plot of empirical data isn't exactly the same as in the article, I'm not sure why as I'm not great with statistics. But, from programming standpoint, this is how you can go about all this.

Related

Histogram line of best fit line is jagged and not smooth?

I can't quite seem to figue out how to get my curve to be displayed smoothly instead of having so many sharp turns.
I am hoping to show a boltzmann probability distribution. With a nice smooth curve.
I'll expect it is a simple fix but I can't see it. Can someone please help?
My code is below:
from matplotlib import pyplot as plt
import numpy as np
import scipy.stats
dE = 1
N = 500
n = 10000
# This is creating an array filled with all twos
def Create_Array(N):
Particle_State_List_set = np.ones(N, dtype = int)
Particle_State_List_twos = Particle_State_List_set + 1
return(Particle_State_List_twos)
Array = Create_Array(N)
def Select_Random_index(N):
Seed = np.random.default_rng()
Partcle_Index = Seed.integers(low=0, high= N - 1)
return(Partcle_Index)
def Exchange(N):
Particle_Index_A = Select_Random_index(N) #Selects a particle to be used as particle "a"
Particle_Index_B = Select_Random_index(N) #Selects a particle to be used as particle "b"
# Checks to see if the energy on particle "a" is zero, if so it selects anbother until it isn't.
while Array[Particle_Index_A] == 1:
Particle_Index_A = Select_Random_index(N)
#This loop is making sure that Particle "a" and "b" aren't the same particle, it chooses again until the are diffrent.
while Particle_Index_B == Particle_Index_A:
Particle_Index_B = Select_Random_index(N)
# This assignes variables to the chosen particle's energy values
a = Array[Particle_Index_A]
b = Array[Particle_Index_B]
# This updates the values of the Energy levels of the interacting particles
Array[Particle_Index_A] = a - dE
Array[Particle_Index_B] = b + dE
return (Array[Particle_Index_A], Array[Particle_Index_B])
for i in range(n):
Exchange(N)
# This part is making the histogram the curve will be made from
_, bins, _ = plt.hist(Array, 12, density=1, alpha=0.15, color="g")
# This is using scipy to find the mean and standard deviation in order to plot the curve
mean, std = scipy.stats.norm.fit(Array)
# This part is drawing the best fit line, using the established bins value and the std and mean from before
best_fit = scipy.stats.norm.pdf(bins, mean, std)
# Plotting the best fit curve
plt.plot(bins, best_fit, color="r", linewidth=2.5)
#These are instructions on how python with show the graph
plt.title("Boltzmann Probablitly Curve")
plt.xlabel("Energy Value")
plt.ylabel('Percentage at this Energy Value')
plt.tick_params(top=True, right=True)
plt.tick_params(direction='in', length=6, width=1, colors='0')
plt.grid()
plt.show()
Whats happening is that in these lines:
best_fit = scipy.stats.norm.pdf(bins, mean, std)
plt.plot(bins, best_fit, color="r", linewidth=2.5)
'bins' the histogram bin edges is being used as the x coordinates of the data points forming the best fit line. The resulting plot is jagged because they are so widely spaced. Instead you can define a tighter packed set of x coordinates and use that:
bfX = np.arange(bins[0],bins[-1],.05)
best_fit = scipy.stats.norm.pdf(bfX, mean, std)
plt.plot(bfX, best_fit, color="r", linewidth=2.5)
For me that gives a nice smooth curve, but you can always use a tighter packing than .05 if its not to your liking yet.

How do I discretize a continuous function avoiding noise generation (see picture)

I have a continuous input function which I would like to discretize into lets say 5-10 discrete bins between 1 and 0. Right now I am using np.digitize and rescale the output bins to 0-1. Now the problem is that sometime datasets (blue line) yield results like this:
I tried pushing up the number of discretization bins but I ended up keeping the same noise and getting just more increments. As an example where the algorithm worked with the same settings but another dataset:
this is the code I used there NumOfDisc = number of bins
intervals = np.linspace(0,1,NumOfDisc)
discretized_Array = np.digitize(Continuous_Array, intervals)
The red ilne in the graph is not important. The continuous blue line is the on I try to discretize and the green line is the discretized result.The Graphs are created with matplotlyib.pyplot using the following code:
def CheckPlots(discretized_Array, Continuous_Array, Temperature, time, PlotName)
logging.info("Plotting...")
#Setting Axis properties and titles
fig, ax = plt.subplots(1, 1)
ax.set_title(PlotName)
ax.set_ylabel('Temperature [°C]')
ax.set_ylim(40, 110)
ax.set_xlabel('Time [s]')
ax.grid(b=True, which="both")
ax2=ax.twinx()
ax2.set_ylabel('DC Power [%]')
ax2.set_ylim(-1.5,3.5)
#Plotting stuff
ax.plot(time, Temperature, label= "Input Temperature", color = '#c70e04')
ax2.plot(time, Continuous_Array, label= "Continuous Power", color = '#040ec7')
ax2.plot(time, discretized_Array, label= "Discrete Power", color = '#539600')
fig.legend(loc = "upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
logging.info("Done!")
logging.info("---")
return
Any Ideas what I could do to get sensible discretizations like in the second case?
The following solution gives the exact result you need.
Basically, the algorithm finds an ideal line, and attempts to replicate it as well as it can with less datapoints. It starts with 2 points at the edges (straight line), then adds one in the center, then checks which side has the greatest error, and adds a point in the center of that, and so on, until it reaches the desired bin count. Simple :)
import warnings
warnings.simplefilter('ignore', np.RankWarning)
def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
"""Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
xs = np.linspace(x0, x1, num=integral_points)
ys = straight_line(xs)
perfect_ys = ideal_line(xs)
err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0) # Remove (x1 - x0) to only look at avg errors
return err
def discretize_bisect(xs, ys, bin_count):
"""Returns xs and ys of discrete points"""
# For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
# If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
new_xs = [xs[0], xs[-1]]
new_ys = [ys[0], ys[-1]]
while len(new_xs) < bin_count:
errors = []
for i in range(len(new_xs)-1):
err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
errors.append(err)
max_segment_id = np.argmax(errors)
new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
new_y = ideal_line(new_x)
new_xs.insert(max_segment_id+1, new_x)
new_ys.insert(max_segment_id+1, new_y)
return new_xs, new_ys
BIN_COUNT = 25
new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)
plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))
Moreover, here's my simplified plotting function I tested with.
def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
"""A simplified version of the provided plotting function"""
# Setting Axis properties and titles
fig, ax = plt.subplots(figsize=(20, 4))
ax.set_title(plot_name)
ax.set_xlabel('Time [s]')
ax.set_ylabel('DC Power [%]')
# Plotting stuff
ax.plot(cont_time, cont_array, label="Continuous Power", color='#0000ff')
ax.plot(disc_time, disc_array, label="Discrete Power", color='#00ff00')
fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
Lastly, here's the Google Colab
If what I described in the comments is the problem, there are a few options to deal with this:
Do nothing: Depending on the reason you're discretizing, you might want the discrete values to reflect the continuous values accurately
Change the bins: you could shift the bins or change the number of bins, such that relatively 'flat' parts of the blue line stay within one bin, thus giving a flat green line in these parts as well, which would be visually more pleasing like in your second plot.

A mathematical explanation for why variance of bootstrap estimates decreases

I am trying to grok bootstrapping and bagging (bootstrap aggregation), so I've been attempting to perform some experiments. I loaded in a sample dataset from Kaggle and attempted to use the bootstrapping method:
X = pd.read_csv("dataset.csv")
true_median = np.median(X["Impressions"])
B = 500
errors = []
variances = []
for b in range(1, B):
sample_medians = [np.median(X.sample(len(X), replace=True)["Impressions"]) for i in range(b)]
error = np.mean(sample_medians) - true_median
variances.append(np.std(sample_medians) ** 2)
errors.append(error)
Then I visualized errors and variances:
fig, ax1 = plt.subplots()
color = 'tab:red'
ax1.set_xlabel('Number of Bootstrap Samples (B)')
ax1.set_ylabel('Bootstrap Estimate Error', color=color)
ax1.plot(errors, color=color, alpha=0.7)
ax1.tick_params(axis='y', labelcolor=color)
ax2 = ax1.twinx()
color = 'tab:blue'
ax2.set_ylabel('Bootstrap Estimate Variance', color=color)
ax2.plot(variances, color=color, alpha=0.7)
ax2.tick_params(axis='y', labelcolor=color)
fig.tight_layout()
plt.title("Relationship Between Bootstrap Error, Variance \nand Number of Bootstrap Iterations")
plt.show()
This is the output of the plot:
You can see that both the error and the variance decrease as B increases.
I'm trying to find some sort of mathematical justification - is there a way to derive or prove why the variance of bootstrap estimates decreases when B increases?
I think what you are seeing is Central-Limit Theorem in play. When the loop starts, the number of samples from the population with replacement is small and the mean of medians (you call this error) is not representative of reaching the true population median. As you generate more samples, the mean of medians converge to true median asymptotically. As the convergence happens towards true mean, the samples from this distribution are not far enough to generate a large variance and it also reaches convergence.
Did that clarify? If not please elaborate what you are expecting to see while plotting them and we can discuss about how to get there.

How can I plot ca. 20 million points as a scatterplot?

I am trying to create a scatterplot with matplotlib that consists of ca. ca. 20 million data points. Even after setting the alpha value to its lowest before ending up with no visible data at all the result is just a completely black plot.
plt.scatter(timedPlotData, plotData, alpha=0.01, marker='.')
The x-axis is a continuous timeline of about 2 months and the y-axis consists of 150k consecutive integer values.
Is there any way to plot all the points so that their distribution over time is still visible?
Thank you for your help.
There's more than one way to do this. A lot of folks have suggested a heatmap/kernel-density-estimate/2d-histogram. #Bucky suggesed using a moving average. In addition, you can fill between a moving min and moving max, and plot the moving mean over the top. I often call this a "chunkplot", but that's a terrible name. The implementation below assumes that your time (x) values are monotonically increasing. If they're not, it's simple enough to sort y by x before "chunking" in the chunkplot function.
Here are a couple of different ideas. Which is best will depend on what you want to emphasize in the plot. Note that this will be rather slow to run, but that's mostly due to the scatterplot. The other plotting styles are much faster.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
np.random.seed(1977)
def main():
x, y = generate_data()
fig, axes = plt.subplots(nrows=3, sharex=True)
for ax in axes.flat:
ax.xaxis_date()
fig.autofmt_xdate()
axes[0].set_title('Scatterplot of all data')
axes[0].scatter(x, y, marker='.')
axes[1].set_title('"Chunk" plot of data')
chunkplot(x, y, chunksize=1000, ax=axes[1],
edgecolor='none', alpha=0.5, color='gray')
axes[2].set_title('Hexbin plot of data')
axes[2].hexbin(x, y)
plt.show()
def generate_data():
# Generate a very noisy but interesting timeseries
x = mdates.drange(dt.datetime(2010, 1, 1), dt.datetime(2013, 9, 1),
dt.timedelta(minutes=10))
num = x.size
y = np.random.random(num) - 0.5
y.cumsum(out=y)
y += 0.5 * y.max() * np.random.random(num)
return x, y
def chunkplot(x, y, chunksize, ax=None, line_kwargs=None, **kwargs):
if ax is None:
ax = plt.gca()
if line_kwargs is None:
line_kwargs = {}
# Wrap the array into a 2D array of chunks, truncating the last chunk if
# chunksize isn't an even divisor of the total size.
# (This part won't use _any_ additional memory)
numchunks = y.size // chunksize
ychunks = y[:chunksize*numchunks].reshape((-1, chunksize))
xchunks = x[:chunksize*numchunks].reshape((-1, chunksize))
# Calculate the max, min, and means of chunksize-element chunks...
max_env = ychunks.max(axis=1)
min_env = ychunks.min(axis=1)
ycenters = ychunks.mean(axis=1)
xcenters = xchunks.mean(axis=1)
# Now plot the bounds and the mean...
fill = ax.fill_between(xcenters, min_env, max_env, **kwargs)
line = ax.plot(xcenters, ycenters, **line_kwargs)[0]
return fill, line
main()
For each day, tally up the frequency of each value (a collections.Counter will do this nicely), then plot a heatmap of the values, one per day. For publication, use a grayscale for the heatmap colors.
My recommendation would be to use a sorting and moving average algorithm on the raw data before you plot it. This should leave the mean and trend intact over the time period of interest while providing you with a reduction in clutter on the plot.
Group values into bands on each day and use a 3d histogram of count, value band, day.
That way you can get the number of occurrences in a given band on each day clearly.

Python 2d-histogram success rate per bin

I have data in arrays x, y and w where 'x' and 'y' indicate position and 'w' is a weight of either 1 or 0 indicating success or failure. I'm trying to create a 2d histogram where each bin of the histogram is coloured based on the percentage of successes in that bin (i.e. # of successes in bin divided by total points in bin). I've played around with numpy.histogram2d quite a bit and can get density plots going, but this is not the same as the % of success scheme I'm aiming for. Please note normed=True in the numpy.histogram2d argument does not alleviate this problem.
(To clarify on the difference, a density plot would indicate large 'colour value' if there is a larger number of successes in the bin regardless of how many failures are in the same bin. I would like to have the percentage of successes instead, so a large number of failures in the same bin would give a smaller 'colour value'. I apologise for poor terminology).
Thank you very much to anyone who can help!
Example of current code that doesn't do what I'm aiming for:
import matplotlib.pyplot as plt
import numpy as np
plt.figure(1)
H, xedges, yedges = np.histogram2d(x, y, bins=50, weights=w, normed=True)
extent = [yedges[0], yedges[-1], xedges[-1], xedges[0]]
plt.imshow(H, extent=extent,interpolation='nearest')
plt.colorbar()
plt.show()
I'm pretty sure this works, but you don't give data, so it's hard to check. normed=True gives you densities, if you don't pass normed=True, you get weighted sample counts, so if you divide your weighted version (which is really just #successes per bin) by unweighted (# of elements in each bin), you'll end up with % successes.
import matplotlib.pyplot as plt
import numpy as np
plt.figure(1)
H, xedges, yedges = np.histogram2d(x, y, bins=50, weights=w)
H2, _, _ = np.histogram2d(x,y, bins=50)
extent = [0,1, xedges[-1], xedges[0]]
plt.imshow(H/H2, extent=extent,interpolation='nearest')
plt.colorbar()
plt.show()
This could leave nan in the final histogram, so you might want to make a decision for those cases.

Categories

Resources