Python: scatter plot with non-linear x axis - python

I have data with lots of x values around zero and only a few as you go up to around 950,
I want to create a plot with a non-linear x axis so that the relationship can be seen in a 'straight line' form. Like seen in this example,
I have tried using plt.xscale('log') but it does not achieve what I want.
I have not been able to use the log scale function with a scatter plot as it then only shows 3 values rather than the thousands that exist.
I have tried to work around it using
plt.plot(retper, aep_NW[y], marker='o', linewidth=0)
to replicate the scatter function which plots but does not show what I want.
plt.figure(1)
plt.scatter(rp,aep,label="SSI sum")
plt.show()
Image 3:
plt.figure(3)
plt.scatter(rp, aep)
plt.xscale('log')
plt.show()
Image 4:
plt.figure(4)
plt.plot(rp, aep, marker='o', linewidth=0)
plt.xscale('log')
plt.show()
ADDITION:
Hi thank you for the response.
I think you are right that my x axis is truncated but I'm not sure why or how...
I'm not really sure what to post code wise as the data is all large and coming from a server so can't really give you the data to see it with.
Basically aep_NW is a one dimensional array with 951 elements, values from 0-~140, with most values being small and only a few larger values. The data represents a storm severity index for 951 years.
Then I want the x axis to be the return period for these values, so basically I made a rp array, of the same size, which is given values from 951 down decreasing my a half each time.
I then sort the aep_NW values from lowest to highest with the highest value being associated with the largest return value (951), then the second highest aep_NW value associated with the second largest return period value (475.5) ect.
So then when I plot it I need the x axis scale to be similar to the example you showed above or the first image I attatched originally.
rp = [0]*numseas.shape[0]
i = numseas.shape[0] - 1
rp[i] = numseas.shape[0]
i = i - 1
while i != 0:
rp[i] = rp[i+1]/2
i = i - 1
y = np.argsort(aep_NW)
fig, ax = plt.subplots()
ax.scatter(rp,aep_NW[y],label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period")
ax.set_ylabel("SSI score")
plt.title("AEP for NW Europe: total loss per entire extended winter season")
plt.show()

It looks like in your "Image 3" the x axis is truncated, so that you don't see the data you are interested in. It appears this is due to there being 0's in your 'rp' array. I updated the examples to show the error you are seeing, one way to exclude the zeros, and one way to clip them and show them on a different scale.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
n = 100
numseas = np.logspace(-5, 3, n)
aep_NW = np.linspace(0, 140, n)
rp = [0]*numseas.shape[0]
i = numseas.shape[0] - 1
rp[i] = numseas.shape[0]
i = i - 1
while i != 0:
rp[i] = rp[i+1] /2
i = i - 1
y = np.argsort(aep_NW)
fig, axes = plt.subplots(1, 3, figsize=(14, 5))
ax = axes[0]
ax.scatter(rp, aep_NW[y], label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period")
ax.set_ylabel("SSI score")
ax = axes[1]
rp = np.array(rp)[y]
mask = rp > 0
ax.scatter(rp[mask], aep_NW[y][mask], label="SSI sum")
ax.set_xscale('log')
ax.set_xlabel("Return period (0 values excluded)")
ax = axes[2]
log2_clipped_rp = np.log2(rp.clip(2**-100, None))[y]
ax.scatter(log2_clipped_rp, aep_NW[y], label="SSI sum")
xticks = list(range(-110, 11, 20))
xticklabels = [f'$2^{{{i}}}$' for i in xticks]
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels)
ax.set_xlabel("log$_2$ Return period (values clipped to 2$^{-100}$)")
plt.show()

Related

Adding histogram bins together and plotting a figure

I have a histogram with 8192 bins, each bin imported from a line from a text file. To cut things short, it makes an awful fit and it was suggested to mee I could reduce the statistical errors by adding counts from adjacent bins. e.g. add bins 0-7 to make a new first bin, 8 times as wide, but 8 times(roughly) as high.
Ideally, would like to just be able to output a histogram of a binwidth controlled by a single constant in the code. However my attempts to do this, instead of producing something like the first image below (which was born of the version of my code which can only do a binwidth of 1, produce something like the second image below, missing fit lines and with a second empty graph in the same image file (born of my attempts to generalise the code for any bin width).
The following a histogram plotted directly from the original data i.e. binwidth = 1
Original code output, only works for binwidth 1 though.
Example for trying bin width 8 with come code modifications
I also need it to return a fit report, and the area under the gaussian as this is plotted later on in the code, in an exponential decay curve.
Here is the section of code I think is relevant:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Load text file
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure and label
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Photon coincidence detections from $β^+$ + $β^-$ annhilation', fontsize=18)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.bar(x, y)
ax.set_xlim(600,960)
## Adding Bins Together
y = y.astype(int)
x = x.astype(int)
## create the data
data = np.repeat(x, y)
## determine the range of x
x_range = range(min(data), max(data)+1)
## determine the length of x
x_len = len(x_range)
## plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
plt.show()
## given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
## determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
## result
print(result.fit_report())
fig.savefig("abw_" + finalprefix + ".png")
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)
## Plot decay curve
fig, ax = plt.subplots()
ax.errorbar(ts, amps, yerr= 2*np.array(ampserr), fmt="ko-", capsize = 5, capthick= 2, elinewidth=3, markersize=5)
plt.xlabel('Time', fontsize=14)
plt.ylabel('Peak amplitude', fontsize=14)
plt.title("Decay curve of P-31 by $β^+$ emission", fontsize=14)
Some synthetic data: {1,2,1,0,0,0,0,0,6,0,0,0,0,0,0,0,7,0,0,1,0,1,0,0,6,6,0,0,0,3,0,0,3,3,3,5,4,0,4,3,1,4,0,5,6,4,0,2,0,0,0,9,6,1,1,1,0,0,3,2,2,3,0,0,0,2,4,0,0,0,0,0,0,4,10,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0}
I think this should create 2 very different shaped histograms when the bin width is 1 and when it is 8. Though I have just made them up, the fit may not be good, and it is worth mentioning one of the problems I was having is related to being able to add together the information read in from the text file
In case it's useful:
-Here is the full original code
-Here is the data for that histogram

Plotting discrete X/Y values over Seaborn heatmap

I'm trying to mark a specific X/Y coordinate on a Seaborn heatmap. The output is plotting the correct X coordinate but is plotting the Y as zero/axis minimum. Code and its heatmap output is shown below.
speeds = pd.read_csv('pytest20140730.csv')
speedspiv = pd.pivot_table(speeds, values = 'speed', index = 'road_order',
columns = 'tod_value', aggfunc = np.mean)
speedspiv = speedspiv.sort_index(axis=0, ascending = False)
plt.subplots(figsize=(20,15))
ax = sns.heatmap(speedspiv)
ax.scatter(414, 0.244444444, marker='*', s=100, color='yellow')
See yellow star on x-axis of the following image
What's up with the Y axis value not plotting correctly?
A seaborn heatmap produces a plot with x- and y-axes, which are index based, not data based. That means, even if labeled with 382...453 the y-axis goes from 0 to 71.
To test this, just print the result of ax.get_xlim() and ax.get_ylim().
For your Problem to put a marker at a meaningful place in the heatmap, this means, that you'll have to compute the correct (fractional) indices of the x- and y- value of the marker and use these in the scatter plot:
x_vals = np.linspace(0, 1, 1500) # I don't know the real range of you x-data
x_idx = np.interp(.24, x_vals, range(len(x_vals)))
y_vals = np.arange(382, 453)
y_idx = np.interp(414, y_vals, range(len(y_vals)))
In: x_idx, y_idx
Out: (359.76, 32.0)

Plot histogram normalized by fixed parameter

I need to plot a plot a normalized histogram (by normalized I mean divided by a fixed value) using the histtype='step' style.
The issue is that plot.bar() doesn't seem to support that style and if I use instead plot.hist() which does, I can't (or at least don't know how) plot the normalized histogram.
Here's a MWE of what I mean:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(200,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Plot histogram normalized by the parameter defined above.
plt.ylim(0, 3)
plt.bar(edges1[:-1], hist1 / param, width=binwidth, color='none', edgecolor='r')
plt.show()
(notice the normalization: hist1 / param) which produces this:
I can generate a histtype='step' histogram using:
plt.hist(x1, bins=bin_n, histtype='step', color='r')
and get:
but then it wouldn't be normalized by the param value.
The step plot will generate the appearance that you want from a set of bins and the count (or normalized count) in those bins. Here I've used plt.hist to get the counts, then plot them, with the counts normalized. It's necessary to duplicate the first entry in order to get it to actually have a line there.
(a,b,c) = plt.hist(x1, bins=bin_n, histtype='step', color='r')
a = np.append(a[0],a[:])
plt.close()
step(b,a/param,color='r')
This is not quite right, because it doesn't finish the plot correctly. the end of the line is hanging in free space rather than dropping down the x axis.
you can fix that by adding a 0 to the end of 'a' and one more bin point to b
a=np.append(a[:],0)
b=np.append(b,(2*b[-1]-b[-2]))
step(b,a/param,color='r')
lastly, the ax.step mentioned would be used if you had used
fig, ax = plt.subplots()
to give you access to the figure and axis directly. For examples, see http://matplotlib.org/examples/ticks_and_spines/spines_demo_bounds.html
Based on tcaswell's comment (use step) I've developed my own answer. Notice that I need to add elements to both the x (one zero element at the beginning of the array) and y arrays (one zero element at the beginning and another at the end of the array) so that step will plot the vertical lines at the beginning and the end of the bars.
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(5000,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Create arrays adding elements so plt.bar will plot the first and last
# vertical bars.
x2 = np.concatenate((np.array([0.]), edges1))
y2 = np.concatenate((np.array([0.]), (hist1 / param), np.array([0.])))
# Plot histogram normalized by the parameter defined above.
plt.xlim(min(edges1) - (min(edges1) / 10.), max(edges1) + (min(edges1) / 10.))
plt.bar(x2, y2, width=binwidth, color='none', edgecolor='b')
plt.step(x2, y2, where='post', color='r', ls='--')
plt.show()
and here's the result:
The red lines generated by step are equal to those blue lines generated by bar as can be seen.

multi colored plots in matplotlib plt based on certain properties of data

I want to color the line in a plot based on the following of a data set on the y axis.
if data > 0:
color = 'r'
if data = 0:
color = 'g'
if data < 0:
color = 'b'
Unfortunately I only know how to color the entire data set one color. I also couldn't find anything on the web. I'm assuming there is a way to do this without breaking up the dataset for every time the color changes.
Below is an example of plotting the data with just one color.
import matplotlib.pyplot as plt
import numpy as np
# Simple data
x = np.linspace(0, 2 * np.pi, 400)
data = np.sin(x ** 2)
#plot
f, ax = plt.subplots()
ax.plot(x, data, color='r')
plt.show()
The color parameter actually can take a list as an argument. For example, here's a simple bit of code that sets up a list of colors based on whether the data is positive or negative:
colors = []
for item in data:
if item < 0:
colors.append('r')
else:
colors.append('g')
then simply:
ax.bar(x, data, color=colors)
Edit: So I tested it, and it appears that my answer is only applicable for bar graphs. I couldn't find anything in the matplotlib documentation that seemed to indicate that coloring a line plot with multiple colors was possible. I did, however find this site, which I believe has the information you want. The guy there defines his own function to achieve it.
Using the file at my link, here is an equivalent version for a line graph:
cmap = ListedColormap(['r', 'g']) # use the colors red and green
norm = BoundaryNorm([-1000,0,1000], cmap.N) # map red to negative and green to positive
# this may work with just 0 in the list
fig, axes = plt.subplots()
colorline(x, data, data, cmap=cmap, norm=norm)
plt.xlim(x.min(), x.max())
plt.ylim(data.min(), data.max())
plt.show()
The last three arguments of colorline here tell it the color data and how to map it.

How can I plot ca. 20 million points as a scatterplot?

I am trying to create a scatterplot with matplotlib that consists of ca. ca. 20 million data points. Even after setting the alpha value to its lowest before ending up with no visible data at all the result is just a completely black plot.
plt.scatter(timedPlotData, plotData, alpha=0.01, marker='.')
The x-axis is a continuous timeline of about 2 months and the y-axis consists of 150k consecutive integer values.
Is there any way to plot all the points so that their distribution over time is still visible?
Thank you for your help.
There's more than one way to do this. A lot of folks have suggested a heatmap/kernel-density-estimate/2d-histogram. #Bucky suggesed using a moving average. In addition, you can fill between a moving min and moving max, and plot the moving mean over the top. I often call this a "chunkplot", but that's a terrible name. The implementation below assumes that your time (x) values are monotonically increasing. If they're not, it's simple enough to sort y by x before "chunking" in the chunkplot function.
Here are a couple of different ideas. Which is best will depend on what you want to emphasize in the plot. Note that this will be rather slow to run, but that's mostly due to the scatterplot. The other plotting styles are much faster.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
np.random.seed(1977)
def main():
x, y = generate_data()
fig, axes = plt.subplots(nrows=3, sharex=True)
for ax in axes.flat:
ax.xaxis_date()
fig.autofmt_xdate()
axes[0].set_title('Scatterplot of all data')
axes[0].scatter(x, y, marker='.')
axes[1].set_title('"Chunk" plot of data')
chunkplot(x, y, chunksize=1000, ax=axes[1],
edgecolor='none', alpha=0.5, color='gray')
axes[2].set_title('Hexbin plot of data')
axes[2].hexbin(x, y)
plt.show()
def generate_data():
# Generate a very noisy but interesting timeseries
x = mdates.drange(dt.datetime(2010, 1, 1), dt.datetime(2013, 9, 1),
dt.timedelta(minutes=10))
num = x.size
y = np.random.random(num) - 0.5
y.cumsum(out=y)
y += 0.5 * y.max() * np.random.random(num)
return x, y
def chunkplot(x, y, chunksize, ax=None, line_kwargs=None, **kwargs):
if ax is None:
ax = plt.gca()
if line_kwargs is None:
line_kwargs = {}
# Wrap the array into a 2D array of chunks, truncating the last chunk if
# chunksize isn't an even divisor of the total size.
# (This part won't use _any_ additional memory)
numchunks = y.size // chunksize
ychunks = y[:chunksize*numchunks].reshape((-1, chunksize))
xchunks = x[:chunksize*numchunks].reshape((-1, chunksize))
# Calculate the max, min, and means of chunksize-element chunks...
max_env = ychunks.max(axis=1)
min_env = ychunks.min(axis=1)
ycenters = ychunks.mean(axis=1)
xcenters = xchunks.mean(axis=1)
# Now plot the bounds and the mean...
fill = ax.fill_between(xcenters, min_env, max_env, **kwargs)
line = ax.plot(xcenters, ycenters, **line_kwargs)[0]
return fill, line
main()
For each day, tally up the frequency of each value (a collections.Counter will do this nicely), then plot a heatmap of the values, one per day. For publication, use a grayscale for the heatmap colors.
My recommendation would be to use a sorting and moving average algorithm on the raw data before you plot it. This should leave the mean and trend intact over the time period of interest while providing you with a reduction in clutter on the plot.
Group values into bands on each day and use a 3d histogram of count, value band, day.
That way you can get the number of occurrences in a given band on each day clearly.

Categories

Resources