How can I plot ca. 20 million points as a scatterplot?

How can I plot ca. 20 million points as a scatterplot? - python

I am trying to create a scatterplot with matplotlib that consists of ca. ca. 20 million data points. Even after setting the alpha value to its lowest before ending up with no visible data at all the result is just a completely black plot.
plt.scatter(timedPlotData, plotData, alpha=0.01, marker='.')
The x-axis is a continuous timeline of about 2 months and the y-axis consists of 150k consecutive integer values.
Is there any way to plot all the points so that their distribution over time is still visible?
Thank you for your help.

There's more than one way to do this. A lot of folks have suggested a heatmap/kernel-density-estimate/2d-histogram. #Bucky suggesed using a moving average. In addition, you can fill between a moving min and moving max, and plot the moving mean over the top. I often call this a "chunkplot", but that's a terrible name. The implementation below assumes that your time (x) values are monotonically increasing. If they're not, it's simple enough to sort y by x before "chunking" in the chunkplot function.
Here are a couple of different ideas. Which is best will depend on what you want to emphasize in the plot. Note that this will be rather slow to run, but that's mostly due to the scatterplot. The other plotting styles are much faster.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
np.random.seed(1977)
def main():
x, y = generate_data()
fig, axes = plt.subplots(nrows=3, sharex=True)
for ax in axes.flat:
ax.xaxis_date()
fig.autofmt_xdate()
axes[0].set_title('Scatterplot of all data')
axes[0].scatter(x, y, marker='.')
axes[1].set_title('"Chunk" plot of data')
chunkplot(x, y, chunksize=1000, ax=axes[1],
edgecolor='none', alpha=0.5, color='gray')
axes[2].set_title('Hexbin plot of data')
axes[2].hexbin(x, y)
plt.show()
def generate_data():
# Generate a very noisy but interesting timeseries
x = mdates.drange(dt.datetime(2010, 1, 1), dt.datetime(2013, 9, 1),
dt.timedelta(minutes=10))
num = x.size
y = np.random.random(num) - 0.5
y.cumsum(out=y)
y += 0.5 * y.max() * np.random.random(num)
return x, y
def chunkplot(x, y, chunksize, ax=None, line_kwargs=None, **kwargs):
if ax is None:
ax = plt.gca()
if line_kwargs is None:
line_kwargs = {}
# Wrap the array into a 2D array of chunks, truncating the last chunk if
# chunksize isn't an even divisor of the total size.
# (This part won't use _any_ additional memory)
numchunks = y.size // chunksize
ychunks = y[:chunksize*numchunks].reshape((-1, chunksize))
xchunks = x[:chunksize*numchunks].reshape((-1, chunksize))
# Calculate the max, min, and means of chunksize-element chunks...
max_env = ychunks.max(axis=1)
min_env = ychunks.min(axis=1)
ycenters = ychunks.mean(axis=1)
xcenters = xchunks.mean(axis=1)
# Now plot the bounds and the mean...
fill = ax.fill_between(xcenters, min_env, max_env, **kwargs)
line = ax.plot(xcenters, ycenters, **line_kwargs)[0]
return fill, line
main()

For each day, tally up the frequency of each value (a collections.Counter will do this nicely), then plot a heatmap of the values, one per day. For publication, use a grayscale for the heatmap colors.

My recommendation would be to use a sorting and moving average algorithm on the raw data before you plot it. This should leave the mean and trend intact over the time period of interest while providing you with a reduction in clutter on the plot.

Group values into bands on each day and use a 3d histogram of count, value band, day.
That way you can get the number of occurrences in a given band on each day clearly.

Related

Python: scatter plot with non-linear x axis

I have data with lots of x values around zero and only a few as you go up to around 950,
I want to create a plot with a non-linear x axis so that the relationship can be seen in a 'straight line' form. Like seen in this example,
I have tried using plt.xscale('log') but it does not achieve what I want.
I have not been able to use the log scale function with a scatter plot as it then only shows 3 values rather than the thousands that exist.
I have tried to work around it using
plt.plot(retper, aep_NW[y], marker='o', linewidth=0)
to replicate the scatter function which plots but does not show what I want.
plt.figure(1)
plt.scatter(rp,aep,label="SSI sum")
plt.show()
Image 3:
plt.figure(3)
plt.scatter(rp, aep)
plt.xscale('log')
plt.show()
Image 4:
plt.figure(4)
plt.plot(rp, aep, marker='o', linewidth=0)
plt.xscale('log')
plt.show()
ADDITION:
Hi thank you for the response.
I think you are right that my x axis is truncated but I'm not sure why or how...
I'm not really sure what to post code wise as the data is all large and coming from a server so can't really give you the data to see it with.
Basically aep_NW is a one dimensional array with 951 elements, values from 0-~140, with most values being small and only a few larger values. The data represents a storm severity index for 951 years.
Then I want the x axis to be the return period for these values, so basically I made a rp array, of the same size, which is given values from 951 down decreasing my a half each time.
I then sort the aep_NW values from lowest to highest with the highest value being associated with the largest return value (951), then the second highest aep_NW value associated with the second largest return period value (475.5) ect.
So then when I plot it I need the x axis scale to be similar to the example you showed above or the first image I attatched originally.
rp = [0]*numseas.shape[0]
i = numseas.shape[0] - 1
rp[i] = numseas.shape[0]
i = i - 1
while i != 0:
rp[i] = rp[i+1]/2
i = i - 1
y = np.argsort(aep_NW)
fig, ax = plt.subplots()
ax.scatter(rp,aep_NW[y],label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period")
ax.set_ylabel("SSI score")
plt.title("AEP for NW Europe: total loss per entire extended winter season")
plt.show()

It looks like in your "Image 3" the x axis is truncated, so that you don't see the data you are interested in. It appears this is due to there being 0's in your 'rp' array. I updated the examples to show the error you are seeing, one way to exclude the zeros, and one way to clip them and show them on a different scale.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
n = 100
numseas = np.logspace(-5, 3, n)
aep_NW = np.linspace(0, 140, n)
rp = [0]*numseas.shape[0]
i = numseas.shape[0] - 1
rp[i] = numseas.shape[0]
i = i - 1
while i != 0:
rp[i] = rp[i+1] /2
i = i - 1
y = np.argsort(aep_NW)
fig, axes = plt.subplots(1, 3, figsize=(14, 5))
ax = axes[0]
ax.scatter(rp, aep_NW[y], label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period")
ax.set_ylabel("SSI score")
ax = axes[1]
rp = np.array(rp)[y]
mask = rp > 0
ax.scatter(rp[mask], aep_NW[y][mask], label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period (0 values excluded)")
ax = axes[2]
log2_clipped_rp = np.log2(rp.clip(2**-100, None))[y]
ax.scatter(log2_clipped_rp, aep_NW[y], label="SSI sum")
xticks = list(range(-110, 11, 20))
xticklabels = [f'$2^{{{i}}}$' for i in xticks]
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels)
ax.set_xlabel("log$_2$ Return period (values clipped to 2$^{-100}$)")
plt.show()

Adding histogram bins together and plotting a figure

I have a histogram with 8192 bins, each bin imported from a line from a text file. To cut things short, it makes an awful fit and it was suggested to mee I could reduce the statistical errors by adding counts from adjacent bins. e.g. add bins 0-7 to make a new first bin, 8 times as wide, but 8 times(roughly) as high.
Ideally, would like to just be able to output a histogram of a binwidth controlled by a single constant in the code. However my attempts to do this, instead of producing something like the first image below (which was born of the version of my code which can only do a binwidth of 1, produce something like the second image below, missing fit lines and with a second empty graph in the same image file (born of my attempts to generalise the code for any bin width).
The following a histogram plotted directly from the original data i.e. binwidth = 1
Original code output, only works for binwidth 1 though.
Example for trying bin width 8 with come code modifications
I also need it to return a fit report, and the area under the gaussian as this is plotted later on in the code, in an exponential decay curve.
Here is the section of code I think is relevant:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Load text file
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure and label
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Photon coincidence detections from $β^+$ + $β^-$ annhilation', fontsize=18)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.bar(x, y)
ax.set_xlim(600,960)
## Adding Bins Together
y = y.astype(int)
x = x.astype(int)
## create the data
data = np.repeat(x, y)
## determine the range of x
x_range = range(min(data), max(data)+1)
## determine the length of x
x_len = len(x_range)
## plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
plt.show()
## given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
## determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
## result
print(result.fit_report())
fig.savefig("abw_" + finalprefix + ".png")
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)
## Plot decay curve
fig, ax = plt.subplots()
ax.errorbar(ts, amps, yerr= 2*np.array(ampserr), fmt="ko-", capsize = 5, capthick= 2, elinewidth=3, markersize=5)
plt.xlabel('Time', fontsize=14)
plt.ylabel('Peak amplitude', fontsize=14)
plt.title("Decay curve of P-31 by $β^+$ emission", fontsize=14)
Some synthetic data: {1,2,1,0,0,0,0,0,6,0,0,0,0,0,0,0,7,0,0,1,0,1,0,0,6,6,0,0,0,3,0,0,3,3,3,5,4,0,4,3,1,4,0,5,6,4,0,2,0,0,0,9,6,1,1,1,0,0,3,2,2,3,0,0,0,2,4,0,0,0,0,0,0,4,10,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0}
I think this should create 2 very different shaped histograms when the bin width is 1 and when it is 8. Though I have just made them up, the fit may not be good, and it is worth mentioning one of the problems I was having is related to being able to add together the information read in from the text file
In case it's useful:
-Here is the full original code
-Here is the data for that histogram

Better visualization of matplotlib plot

I want to include a plot in my thesis (document will be standard a4 page pdf) for which I have data of two time series, both a continuous values expressed as percentages.
Both time series are over one year without sundays, so something of about 310 data points for each of them.
I tried to come with something like this,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
ts = day_agg_plan_temp.set_index('Date')
ts = ts['2018-01-01': '2019-01-01']
plt.figure(figsize=(20,15))
ax1 = ts.label.plot(grid=True, label='Ground Truth', marker='.')
ax2 = ts.pred.plot(grid=True, label='Prediction', marker='.')
plt.legend()
plt.show()
resulting in this:
This is not really appealing, as there is too much going on and I want to point the difference for each of the data points of the blue and orange line.
So my question is, is there a way to do it better other than shrinking the date range (which I really don't want because this plot is already a snippet of the actual time series which covers almost 3 years)

Here is some code that generates data using Fractional Brownian motion, calculates a trend using a Savitzky–Golay filter (but use whatever is best for you case study), and plots it in a way the user can see the original data and the trend clearly at the same time.
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
# Generating some Random Data
def brownian(x0, n, dt, delta, out=None):
x0 = np.asarray(x0)
r = norm.rvs(size=x0.shape + (n,), scale=delta * sqrt(dt))
if out is None:
out = np.empty(r.shape)
np.cumsum(r, axis=-1, out=out)
out += np.expand_dims(x0, axis=-1)
return out
delta = 2
T = 10.0
N = 500
dt = T/N
m = 2
x = np.empty((m,N+1))
x[:, 0] = 50
brownian(x[:,0], N, dt, delta, out=x[:,1:])
t = np.linspace(0.0, N*dt, N+1)
# Obtaining the trend using some arbitrary filter
y1 = savgol_filter(x[0], 51, 3)
y2 = savgol_filter(x[1], 51, 3)
# Plotting the raw data (transparent)
plt.plot(t, x[0], color="red", alpha=0.2)
plt.plot(t, x[1], color="blue", alpha=0.2)
# Plotting the trend data (opaque)
plt.plot(t, y1, color="red")
plt.plot(t, y2, color="blue")
# Calling the plot
plt.show()
The result is this:
My point is that by playing with the colors (or transparency) you can make some data appear as if in a background, and other (the most relevant usually) as if appearing in the foreground. It's an UX technique (like blurring, darkening, or make the background paler).
You can also play with the line width (or style) if the vertical variability of the data is not enough to clearly separate the sets. In your case I don't think it will be necessary.

Scale colormap for contour and contourf

I'm trying to plot the contour map of a given function f(x,y), but since the functions output scales really fast, I'm losing a lot of information for lower values of x and y. I found on the forums to work that out using vmax=vmax, it actually worked, but only when plotted for a specific limit of x and y and levels of the colormap.
Say I have this plot:
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
u = np.linspace(-2,2,1000)
x,y = np.meshgrid(u,u)
z = (1-x)**2+100*(y-x**2)**2
cont = plt.contour(x,y,z,500,colors='black',linewidths=.3)
cont = plt.contourf(x,y,z,500,cmap="jet",vmax=100)
plt.colorbar(cont)
plt.show
I want to uncover whats beyond the axis limits keeping the same scale, but if I change de x and y limits to -3 and 3 I get:
See how I lost most of my levels since my max value for the function at these limits are much higher. A work around to this problem is to increase the levels to 1000, but that takes a lot of computational time.
Is there a way to plot only the contour levels that I need? That is, between 0 and 100.
An example of a desired output would be:
With the white space being the continuation of the plot without resizing the levels.
The code I'm using is the one given after the first image.

There are a few possible ideas here. The one I very much prefer is a logarithmic representation of the data. An example would be
from matplotlib import ticker
fig = plt.figure(1)
cont1 = plt.contourf(x,y,z,cmap="jet",locator=ticker.LogLocator(numticks=10))
plt.colorbar(cont1)
plt.show()
fig = plt.figure(2)
cont2 = plt.contourf(x,y,np.log10(z),100,cmap="jet")
plt.colorbar(cont2)
plt.show()
The first example uses matplotlibs LogLocator functions. The second one just directly computes the logarithm of the data and plots that normally.
The third example just caps all data above 100.
fig = plt.figure(3)
zcapped = z.copy()
zcapped[zcapped>100]=100
cont3 = plt.contourf(x,y,zcapped,100,cmap="jet")
cbar = plt.colorbar(cont3)
plt.show()

Plot histogram normalized by fixed parameter

I need to plot a plot a normalized histogram (by normalized I mean divided by a fixed value) using the histtype='step' style.
The issue is that plot.bar() doesn't seem to support that style and if I use instead plot.hist() which does, I can't (or at least don't know how) plot the normalized histogram.
Here's a MWE of what I mean:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(200,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Plot histogram normalized by the parameter defined above.
plt.ylim(0, 3)
plt.bar(edges1[:-1], hist1 / param, width=binwidth, color='none', edgecolor='r')
plt.show()
(notice the normalization: hist1 / param) which produces this:
I can generate a histtype='step' histogram using:
plt.hist(x1, bins=bin_n, histtype='step', color='r')
and get:
but then it wouldn't be normalized by the param value.

The step plot will generate the appearance that you want from a set of bins and the count (or normalized count) in those bins. Here I've used plt.hist to get the counts, then plot them, with the counts normalized. It's necessary to duplicate the first entry in order to get it to actually have a line there.
(a,b,c) = plt.hist(x1, bins=bin_n, histtype='step', color='r')
a = np.append(a[0],a[:])
plt.close()
step(b,a/param,color='r')
This is not quite right, because it doesn't finish the plot correctly. the end of the line is hanging in free space rather than dropping down the x axis.
you can fix that by adding a 0 to the end of 'a' and one more bin point to b
a=np.append(a[:],0)
b=np.append(b,(2*b[-1]-b[-2]))
step(b,a/param,color='r')
lastly, the ax.step mentioned would be used if you had used
fig, ax = plt.subplots()
to give you access to the figure and axis directly. For examples, see http://matplotlib.org/examples/ticks_and_spines/spines_demo_bounds.html

Based on tcaswell's comment (use step) I've developed my own answer. Notice that I need to add elements to both the x (one zero element at the beginning of the array) and y arrays (one zero element at the beginning and another at the end of the array) so that step will plot the vertical lines at the beginning and the end of the bars.
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(5000,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Create arrays adding elements so plt.bar will plot the first and last
# vertical bars.
x2 = np.concatenate((np.array([0.]), edges1))
y2 = np.concatenate((np.array([0.]), (hist1 / param), np.array([0.])))
# Plot histogram normalized by the parameter defined above.
plt.xlim(min(edges1) - (min(edges1) / 10.), max(edges1) + (min(edges1) / 10.))
plt.bar(x2, y2, width=binwidth, color='none', edgecolor='b')
plt.step(x2, y2, where='post', color='r', ls='--')
plt.show()
and here's the result:
The red lines generated by step are equal to those blue lines generated by bar as can be seen.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I plot ca. 20 million points as a scatterplot? - python

For each day, tally up the frequency of each value (a collections.Counter will do this nicely), then plot a heatmap of the values, one per day. For publication, use a grayscale for the heatmap colors.

My recommendation would be to use a sorting and moving average algorithm on the raw data before you plot it. This should leave the mean and trend intact over the time period of interest while providing you with a reduction in clutter on the plot.

Group values into bands on each day and use a 3d histogram of count, value band, day. That way you can get the number of occurrences in a given band on each day clearly.

Related

Python: scatter plot with non-linear x axis

Adding histogram bins together and plotting a figure

Better visualization of matplotlib plot

Scale colormap for contour and contourf

Plot histogram normalized by fixed parameter

Categories

Resources