Related
I have a continuous input function which I would like to discretize into lets say 5-10 discrete bins between 1 and 0. Right now I am using np.digitize and rescale the output bins to 0-1. Now the problem is that sometime datasets (blue line) yield results like this:
I tried pushing up the number of discretization bins but I ended up keeping the same noise and getting just more increments. As an example where the algorithm worked with the same settings but another dataset:
this is the code I used there NumOfDisc = number of bins
intervals = np.linspace(0,1,NumOfDisc)
discretized_Array = np.digitize(Continuous_Array, intervals)
The red ilne in the graph is not important. The continuous blue line is the on I try to discretize and the green line is the discretized result.The Graphs are created with matplotlyib.pyplot using the following code:
def CheckPlots(discretized_Array, Continuous_Array, Temperature, time, PlotName)
logging.info("Plotting...")
#Setting Axis properties and titles
fig, ax = plt.subplots(1, 1)
ax.set_title(PlotName)
ax.set_ylabel('Temperature [°C]')
ax.set_ylim(40, 110)
ax.set_xlabel('Time [s]')
ax.grid(b=True, which="both")
ax2=ax.twinx()
ax2.set_ylabel('DC Power [%]')
ax2.set_ylim(-1.5,3.5)
#Plotting stuff
ax.plot(time, Temperature, label= "Input Temperature", color = '#c70e04')
ax2.plot(time, Continuous_Array, label= "Continuous Power", color = '#040ec7')
ax2.plot(time, discretized_Array, label= "Discrete Power", color = '#539600')
fig.legend(loc = "upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
logging.info("Done!")
logging.info("---")
return
Any Ideas what I could do to get sensible discretizations like in the second case?
The following solution gives the exact result you need.
Basically, the algorithm finds an ideal line, and attempts to replicate it as well as it can with less datapoints. It starts with 2 points at the edges (straight line), then adds one in the center, then checks which side has the greatest error, and adds a point in the center of that, and so on, until it reaches the desired bin count. Simple :)
import warnings
warnings.simplefilter('ignore', np.RankWarning)
def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
"""Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
xs = np.linspace(x0, x1, num=integral_points)
ys = straight_line(xs)
perfect_ys = ideal_line(xs)
err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0) # Remove (x1 - x0) to only look at avg errors
return err
def discretize_bisect(xs, ys, bin_count):
"""Returns xs and ys of discrete points"""
# For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
# If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
new_xs = [xs[0], xs[-1]]
new_ys = [ys[0], ys[-1]]
while len(new_xs) < bin_count:
errors = []
for i in range(len(new_xs)-1):
err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
errors.append(err)
max_segment_id = np.argmax(errors)
new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
new_y = ideal_line(new_x)
new_xs.insert(max_segment_id+1, new_x)
new_ys.insert(max_segment_id+1, new_y)
return new_xs, new_ys
BIN_COUNT = 25
new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)
plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))
Moreover, here's my simplified plotting function I tested with.
def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
"""A simplified version of the provided plotting function"""
# Setting Axis properties and titles
fig, ax = plt.subplots(figsize=(20, 4))
ax.set_title(plot_name)
ax.set_xlabel('Time [s]')
ax.set_ylabel('DC Power [%]')
# Plotting stuff
ax.plot(cont_time, cont_array, label="Continuous Power", color='#0000ff')
ax.plot(disc_time, disc_array, label="Discrete Power", color='#00ff00')
fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
Lastly, here's the Google Colab
If what I described in the comments is the problem, there are a few options to deal with this:
Do nothing: Depending on the reason you're discretizing, you might want the discrete values to reflect the continuous values accurately
Change the bins: you could shift the bins or change the number of bins, such that relatively 'flat' parts of the blue line stay within one bin, thus giving a flat green line in these parts as well, which would be visually more pleasing like in your second plot.
I have data with lots of x values around zero and only a few as you go up to around 950,
I want to create a plot with a non-linear x axis so that the relationship can be seen in a 'straight line' form. Like seen in this example,
I have tried using plt.xscale('log') but it does not achieve what I want.
I have not been able to use the log scale function with a scatter plot as it then only shows 3 values rather than the thousands that exist.
I have tried to work around it using
plt.plot(retper, aep_NW[y], marker='o', linewidth=0)
to replicate the scatter function which plots but does not show what I want.
plt.figure(1)
plt.scatter(rp,aep,label="SSI sum")
plt.show()
Image 3:
plt.figure(3)
plt.scatter(rp, aep)
plt.xscale('log')
plt.show()
Image 4:
plt.figure(4)
plt.plot(rp, aep, marker='o', linewidth=0)
plt.xscale('log')
plt.show()
ADDITION:
Hi thank you for the response.
I think you are right that my x axis is truncated but I'm not sure why or how...
I'm not really sure what to post code wise as the data is all large and coming from a server so can't really give you the data to see it with.
Basically aep_NW is a one dimensional array with 951 elements, values from 0-~140, with most values being small and only a few larger values. The data represents a storm severity index for 951 years.
Then I want the x axis to be the return period for these values, so basically I made a rp array, of the same size, which is given values from 951 down decreasing my a half each time.
I then sort the aep_NW values from lowest to highest with the highest value being associated with the largest return value (951), then the second highest aep_NW value associated with the second largest return period value (475.5) ect.
So then when I plot it I need the x axis scale to be similar to the example you showed above or the first image I attatched originally.
rp = [0]*numseas.shape[0]
i = numseas.shape[0] - 1
rp[i] = numseas.shape[0]
i = i - 1
while i != 0:
rp[i] = rp[i+1]/2
i = i - 1
y = np.argsort(aep_NW)
fig, ax = plt.subplots()
ax.scatter(rp,aep_NW[y],label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period")
ax.set_ylabel("SSI score")
plt.title("AEP for NW Europe: total loss per entire extended winter season")
plt.show()
It looks like in your "Image 3" the x axis is truncated, so that you don't see the data you are interested in. It appears this is due to there being 0's in your 'rp' array. I updated the examples to show the error you are seeing, one way to exclude the zeros, and one way to clip them and show them on a different scale.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
n = 100
numseas = np.logspace(-5, 3, n)
aep_NW = np.linspace(0, 140, n)
rp = [0]*numseas.shape[0]
i = numseas.shape[0] - 1
rp[i] = numseas.shape[0]
i = i - 1
while i != 0:
rp[i] = rp[i+1] /2
i = i - 1
y = np.argsort(aep_NW)
fig, axes = plt.subplots(1, 3, figsize=(14, 5))
ax = axes[0]
ax.scatter(rp, aep_NW[y], label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period")
ax.set_ylabel("SSI score")
ax = axes[1]
rp = np.array(rp)[y]
mask = rp > 0
ax.scatter(rp[mask], aep_NW[y][mask], label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period (0 values excluded)")
ax = axes[2]
log2_clipped_rp = np.log2(rp.clip(2**-100, None))[y]
ax.scatter(log2_clipped_rp, aep_NW[y], label="SSI sum")
xticks = list(range(-110, 11, 20))
xticklabels = [f'$2^{{{i}}}$' for i in xticks]
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels)
ax.set_xlabel("log$_2$ Return period (values clipped to 2$^{-100}$)")
plt.show()
I want to plot two different functions in the same figure. However I want them to use different scales on their x-axis.
One scale shoudl just show the values of x and the others will have to show seconds in the end.
Right now I have this
k=5
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_xlabel(r"values of x") #adds description to scale on bottom
ax2 = ax1.twiny() #adds the seconds scale on top
x = np.arange(0.1, 1.5, 0.1) #values of x for function are in range
y = k*(np.power(x,(k-1))) * np.exp(-(np.power(x,(k-1)))) #that is the function I want to draw
ax1.plot(x,y) #draw function
tx = x
ty = x*7
ax2.plot(x,x*7)
ax2.set_xlabel(r"time in seconds")
ax2.set_xlim(1484) #set limit of time
ax2.invert_xaxis() #invert it so that it works like we want to
ax1.set_xlim(0.1,1.4) #set limit for the x axis so that it doesn't skale on its own.
plt.show()
I am sorry but I could not properly insert the code.
The ax2 function is right now just a dummy. I just want to be able to see it and also in the end change the scale of the ax2 to my time frame.
Any help would be greatly appreciated!
I am not sure your code doesn't work :-p
Your dummy function for ax2 is not good enough, I replaced it with ax2.plot(x*1000,x*50) to be able to see it.
And I do the plotting after the rescaling :
k=5
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_xlabel(r"values of x") #adds description to scale on bottom
ax2 = ax1.twiny() #adds the seconds scale on top
x = np.arange(0.1, 1.5, 0.1) #values of x for function are in range
y = k*(np.power(x,(k-1))) * np.exp(-(np.power(x,(k-1)))) #that is the function I want to draw
ax1.plot(x,y) #draw function
tx = x
ty = x*7
ax2.set_xlabel(r"time in seconds")
ax2.set_xlim(1484) #set limit of time
ax2.invert_xaxis() #invert it so that it works like we want to
ax2.plot(x*1000,x*50)
ax1.set_xlim(0.1,1.4) #set limit for the x axis so that it doesn't skale on its own.
plt.show()
Which gives :
The second plot is hidden behind the left Y axis. You will be able to see it if you use a thicker line and/or markers:
ax2.plot(x,x*7, '-o', lw=5)
You could as well change the x limits of ax2 but you went out of your way to make it as it is so I guess it is exactly as you want it to be.
I have a small issue with matplotlib.pyplot and I hope someone might have come across it before.
I have data that contain X,Y,e values that are the X, Y measurements of a variable and e are the errors of the measurements in Y. I need to plot them in a log log scale.
I use the plt.errorbars function to plot them and then set yscale and xscale to log and this works fine. But I need to also plot a line on the same graph that needs to be in linear scale.
I am able to have the plots done separately just fine but I would like to have them in the same image if possible. Do you have any ideas? I am posting what I have done for now.
Cheers,
Kimon
tdlist = np.array([0.01,0.02,0.05,0.1,0.2,0.3,0.4,0.5,0.8,1,2,5,10,15,20,25,30,40,60,80,100,150,200,250,300,400])
freqlist=np.array([30,40,50,60,70,80,90,100,110,120,140,160,180,200,220,250,300,350,400,450])
filename=opts.filename
data = reader(filename)
data2 = logconv(data)
#x,y,e the data. Calculating usefull sums
x = data2[0]
y = data2[1]
e = data2[2]
xoe2 = np.sum(x/e**2)
yoe2 = np.sum(y/e**2)
xyoe2 = np.sum(x*y/e**2)
oe2 = np.sum(1/e**2)
x2oe2 = np.sum(x**2/e**2)
aslope = (xoe2*yoe2-xyoe2*oe2)/(xoe2**2-x2oe2*oe2)
binter = (xyoe2-aslope*x2oe2)/xoe2
aerr = np.sqrt(oe2/(x2oe2*oe2-xoe2**2))
berr = np.sqrt(x2oe2/(x2oe2*oe2-xoe2**2))
print('slope is ',aslope,' +- ', aerr)
print('inter is ',binter,' +- ', berr)
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
ax2 = fig.add_axes(ax1.get_position(), frameon=False)
ax1.errorbar(data[0],data[1],yerr=data[2],fmt='o')
ax1.set_xscale('log',basex=10)
ax1.set_yscale('log',basey=10)
ax1.set_yticks([])
ax1.set_xticks([])
ax2.plot(x,aslope*x+binter,'r')
ax2.plot(x,(aslope-aerr)*x+(binter+berr),'--')
ax2.plot(x,(aslope+aerr)*x+(binter-berr),'--')
ax2.set_xscale('linear')
ax2.set_yscale('linear')
plt.xticks(np.log10(freqlist),freqlist.astype('int'))
plt.yticks(np.log10(tdlist),tdlist.astype('float'))
plt.xlabel('Frequency (MHz)')
plt.ylabel('t_s (msec)')
fitndx1 = 'Fit slope '+"{0:.2f}".format(aslope)+u"\u00B1"+"{0:.2f}".format(aerr)
plt.legend(('Data',fitndx1))
plt.show()
Following Molly's suggestion I managed to get closer to my goal but still not there. I am adding a bit more info for what I am trying to do and it might clarify things a bit.
I am setting ax1 to the errobar plot that uses loglog scale. I need to use errorbar and not loglog plot so that I can display the errors with my points.
I am using ax2 to plot the linear fit in linealinear scale.
Moreover I do not want the x and y axes to display values that are 10,100,1000 powers of ten but my own axes labels that have the spacing I want therefore I am using the plt.xticks. I tried ax1.set_yticks and ax1.set_yticklabes but with no success. Below is the image I am getting.
I do not have enough reputation to post an image but here is the link of it uploaded
http://postimg.org/image/uojanigab/
The values of my points should be x range = 40 - 80 and y range = 5 -200 as the fit lines are now.
You can create two overlapping axes using the add_suplot method of figure. Here's an example:
from matplotlib import pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
ax2 = fig.add_axes(ax1.get_position(), frameon=False)
ax1.loglog([1,10,100,1000],[1000,1,100,10])
ax2.plot([5,10,11,13],'r')
plt.show()
You can then turn off the x and y ticks for the linear scale plot like this:
ax2.set_xticks([])
ax2.set_yticks([])
I was not able to get two sets of axis working with the errorbar function so I had to convert everything to log scale including my linear plot. Below is the code I use to get it might be useful to someone.
plt.errorbar(data[0],data[1],yerr=data[2],fmt='o')
plt.xscale('log',basex=10)
plt.yscale('log',basey=10)
plt.plot(data[0],data[0]**aslope*10**binter,'r')
plt.plot(data[0],data[0]**(aslope-aerr)*10**(binter+berr),'--')
plt.plot(data[0],data[0]**(aslope+aerr)*10**(binter-berr),'--')
plt.xticks(freqlist,freqlist.astype('int'))
plt.yticks(tdlist,tdlist.astype('float'))
plt.xlabel('Frequency (MHz)')
plt.ylabel('t_s (msec)')
fitndx1 = 'Fit slope '+"{0:.2f}".format(aslope)+u"\u00B1"+"{0:.2f}".format(aerr)
plt.legend(('Data',fitndx1))
plt.show()
And here is the link to the final image
http://postimg.org/image/bevj2k6nf/
I am trying to reproduce the work from http://jheusser.github.io/2013/09/08/hawkes.html in python except with different data. I have written code to simulate a Poisson process as well as the Hawkes process they describe.
To do the Hawkes model MLE I define the log likelihood function as
def loglikelihood(params, data):
(mu, alpha, beta) = params
tlist = np.array(data)
r = np.zeros(len(tlist))
for i in xrange(1,len(tlist)):
r[i] = math.exp(-beta*(tlist[i]-tlist[i-1]))*(1+r[i-1])
loglik = -tlist[-1]*mu
loglik = loglik+alpha/beta*sum(np.exp(-beta*(tlist[-1]-tlist))-1)
loglik = loglik+np.sum(np.log(mu+alpha*r))
return -loglik
Using some dummy data, we can compute the MLE for the Hawkes process with
atimes=[58.98353497, 59.28420225, 59.71571013, 60.06750179, 61.24794134,
61.70692463, 61.73611983, 62.28593814, 62.51691723, 63.17370423
,63.20125152, 65.34092403, 214.24934446, 217.0390236, 312.18830525,
319.38385604, 320.31758188, 323.50201334, 323.76801537, 323.9417007]
res = minimize(loglikelihood, (0.01, 0.1,0.1),method='Nelder-Mead',args = (atimes,))
print res
However, I don't know how to do the following things in python.
How can I do the equivalent of evalCIF to get a similar fitted versus empirical intensities plot as they have?
How can I compute the residuals for the Hawkes model to make the equivalent of the QQ plot they have. They say they use an R package called ptproc but I can't find a python equivalent.
OK, so first thing that you may wish to do is to plot the data. To keep it simple I've reproduced this figure as it only has 8 events occurring so it's easy to see the behaviour of the system. The following code:
import numpy as np
import math, matplotlib
import matplotlib.pyplot
import matplotlib.lines
mu = 0.1 # Parameter values as found in the article http://jheusser.github.io/2013/09/08/hawkes.html Hawkes Process section.
alpha = 1.0
beta = 0.5
EventTimes = np.array([0.7, 1.2, 2.0, 3.8, 7.1, 8.2, 8.9, 9.0])
" Compute conditional intensities for all times using the Hawkes process. "
timesOfInterest = np.linspace(0.0, 10.0, 100) # Times where the intensity will be sampled.
conditionalIntensities = [] # Conditional intensity for every epoch of interest.
for t in timesOfInterest:
conditionalIntensities.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in EventTimes] ).sum() ) # Find the contributions of all preceding events to the overall chance of another one occurring. All events that occur after t have no contribution.
" Plot the conditional intensity time history. "
fig = matplotlib.pyplot.figure()
ax = fig.gca()
labelsFontSize = 16
ticksFontSize = 14
fig.suptitle(r"$Conditional\ intensity\ VS\ time$", fontsize=20)
ax.grid(True)
ax.set_xlabel(r'$Time$',fontsize=labelsFontSize)
ax.set_ylabel(r'$\lambda$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
eventsScatter = ax.scatter(EventTimes,np.ones(len(EventTimes))) # Just to indicate where the events took place.
ax.plot(timesOfInterest, conditionalIntensities, color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fittedPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fig.legend([fittedPlot, eventsScatter], [r'$Conditional\ intensity\ computed\ from\ events$', r'$Events$'])
matplotlib.pyplot.show()
reproduces the figure pretty accurately, even though I've chosen the event epochs somewhat arbitrarily:
This can also be applied to the set of example set of data of 5000 trades by binning the data and treating every bin as an event. However, what happens now, every event has a slightly different weight as different number of trades occurs in every bin.
This is also mentioned in the article in Fitting Bitcoin Trade Arrival to a Hawkes Process section with a proposed way to overcome this problem: The only difference to the original dataset is that I added a random millisecond timestamp to all trades that share a timestamp with another trade. This is required as the model requires to distinguish every trade (i.e. every trade must have a unique timestamp). This is incorporated in the following code:
import numpy as np
import math, matplotlib, pandas
import scipy.optimize
import matplotlib.pyplot
import matplotlib.lines
" Read example trades' data. "
all_trades = pandas.read_csv('all_trades.csv', parse_dates=[0], index_col=0) # All trades' data.
all_counts = pandas.DataFrame({'counts': np.ones(len(all_trades))}, index=all_trades.index) # Only the count of the trades is really important.
empirical_1min = all_counts.resample('1min', how='sum') # Bin the data so find the number of trades in 1 minute intervals.
baseEventTimes = np.array( range(len(empirical_1min.values)), dtype=np.float64) # Dummy times when the events take place, don't care too much about actual epochs where the bins are placed - this could be scaled to days since epoch, second since epoch and any other measure of time.
eventTimes = [] # With the event batches split into separate events.
for i in range(len(empirical_1min.values)): # Deal with many events occurring at the same time - need to distinguish between them by splitting each batch of events into distinct events taking place at almost the same time.
if not np.isnan(empirical_1min.values[i]):
for j in range(empirical_1min.values[i]):
eventTimes.append(baseEventTimes[i]+0.000001*(j+1)) # For every event that occurrs at this epoch enter a dummy event very close to it in time that will increase the conditional intensity.
eventTimes = np.array( eventTimes, dtype=np.float64 ) # Change to array for ease of operations.
" Find a fit for alpha, beta, and mu that minimises loglikelihood for the input data. "
#res = scipy.optimize.minimize(loglikelihood, (0.01, 0.1,0.1), method='Nelder-Mead', args = (eventTimes,))
#(mu, alpha, beta) = res.x
mu = 0.07 # Parameter values as found in the article.
alpha = 1.18
beta = 1.79
" Compute conditional intensities for all epochs using the Hawkes process - add more points to see how the effect of individual events decays over time. "
conditionalIntensitiesPlotting = [] # Conditional intensity for every epoch of interest.
timesOfInterest = np.linspace(eventTimes.min(), eventTimes.max(), eventTimes.size*10) # Times where the intensity will be sampled. Sample at much higher frequency than the events occur at.
for t in timesOfInterest:
conditionalIntensitiesPlotting.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in eventTimes] ).sum() ) # Find the contributions of all preceding events to the overall chance of another one occurring. All events that occur after time of interest t have no contribution.
" Compute conditional intensities at the same epochs as the empirical data are known. "
conditionalIntensities=[] # This will be used in the QQ plot later, has to have the same size as the empirical data.
for t in np.linspace(eventTimes.min(), eventTimes.max(), eventTimes.size):
conditionalIntensities.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in eventTimes] ).sum() ) # Use eventTimes here as well to feel the influence of all the events that happen at the same time.
" Plot the empirical and fitted datasets. "
fig = matplotlib.pyplot.figure()
ax = fig.gca()
labelsFontSize = 16
ticksFontSize = 14
fig.suptitle(r"$Conditional\ intensity\ VS\ time$", fontsize=20)
ax.grid(True)
ax.set_xlabel(r'$Time$',fontsize=labelsFontSize)
ax.set_ylabel(r'$\lambda$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
# Plot the empirical binned data.
ax.plot(baseEventTimes,empirical_1min.values, color='blue', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
empiricalPlot = matplotlib.lines.Line2D([],[],color='blue', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
# And the fit obtained using the Hawkes function.
ax.plot(timesOfInterest, conditionalIntensitiesPlotting, color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fittedPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
fig.legend([fittedPlot, empiricalPlot], [r'$Fitted\ data$', r'$Empirical\ data$'])
matplotlib.pyplot.show()
This generates the following fit to the plot:
All looking good but, when you look at the detail, you'll see that computing the residuals by simply taking one vector of the number of trades and subtracting the fitted one won't do since they have different lengths:
It is possible, however, to extract the intensity at the same epochs as when it was recorded for the empirical data and then compute the residuals. This enables you to find quantiles of both empirical and fitted data and plot them against each other thus generating the QQ plot:
""" GENERATE THE QQ PLOT. """
" Process the data and compute the quantiles. "
orderStatistics=[]; orderStatistics2=[];
for i in range( empirical_1min.values.size ): # Make sure all the NANs are filtered out and both arrays have the same size.
if not np.isnan( empirical_1min.values[i] ):
orderStatistics.append(empirical_1min.values[i])
orderStatistics2.append(conditionalIntensities[i])
orderStatistics = np.array(orderStatistics); orderStatistics2 = np.array(orderStatistics2);
orderStatistics.sort(axis=0) # Need to sort data in ascending order to make a QQ plot. orderStatistics is a column vector.
orderStatistics2.sort()
smapleQuantiles=np.zeros( orderStatistics.size ) # Quantiles of the empirical data.
smapleQuantiles2=np.zeros( orderStatistics2.size ) # Quantiles of the data fitted using the Hawkes process.
for i in range( orderStatistics.size ):
temp = int( 100*(i-0.5)/float(smapleQuantiles.size) ) # (i-0.5)/float(smapleQuantiles.size) th quantile. COnvert to % as expected by the numpy function.
if temp<0.0:
temp=0.0 # Avoid having -ve percentiles.
smapleQuantiles[i] = np.percentile(orderStatistics, temp)
smapleQuantiles2[i] = np.percentile(orderStatistics2, temp)
" Make the quantile plot of empirical data first. "
fig2 = matplotlib.pyplot.figure()
ax2 = fig2.gca(aspect="equal")
fig2.suptitle(r"$Quantile\ plot$", fontsize=20)
ax2.grid(True)
ax2.set_xlabel(r'$Sample\ fraction\ (\%)$',fontsize=labelsFontSize)
ax2.set_ylabel(r'$Observations$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
distScatter = ax2.scatter(smapleQuantiles, orderStatistics, c='blue', marker='o') # If these are close to the straight line with slope line these points come from a normal distribution.
ax2.plot(smapleQuantiles, smapleQuantiles, color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
normalDistPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
fig2.legend([normalDistPlot, distScatter], [r'$Normal\ distribution$', r'$Empirical\ data$'])
matplotlib.pyplot.show()
" Make a QQ plot. "
fig3 = matplotlib.pyplot.figure()
ax3 = fig3.gca(aspect="equal")
fig3.suptitle(r"$Quantile\ -\ Quantile\ plot$", fontsize=20)
ax3.grid(True)
ax3.set_xlabel(r'$Empirical\ data$',fontsize=labelsFontSize)
ax3.set_ylabel(r'$Data\ fitted\ with\ Hawkes\ distribution$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize)
matplotlib.rc('ytick', labelsize=ticksFontSize)
distributionScatter = ax3.scatter(smapleQuantiles, smapleQuantiles2, c='blue', marker='x') # If these are close to the straight line with slope line these points come from a normal distribution.
ax3.plot(smapleQuantiles, smapleQuantiles, color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
normalDistPlot2 = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
fig3.legend([normalDistPlot2, distributionScatter], [r'$Normal\ distribution$', r'$Comparison\ of\ datasets$'])
matplotlib.pyplot.show()
This generates the following plots:
The quantile plot of empirical data isn't exactly the same as in the article, I'm not sure why as I'm not great with statistics. But, from programming standpoint, this is how you can go about all this.