Matplotlib slow when plotting pre-cached data into many subplots - python

Although there are many matplotlib optimization posts around, I didn't find the exact tips I want here, such as:
Matplotlib slow with large data sets, how to enable decimation?
Matplotlib - Fast way to create many subplots?
My problem is that I have cached CSV files of time-series data (40 of them).
I'd like to plot them in one plot with 40 subplots in a vertical series, and output them to a single rasterized image.
My code using matplotlib is as follows:
def _Draw(self):
"""Output a graph of subplots."""
BigFont = 10
# Prepare subplots.
nFiles = len(self.inFiles)
fig = plt.figure()
plt.axis('off')
for i, f in enumerate(self.inFiles[0:3]):
pltTitle = '{}:{}'.format(i, f)
colorFile = self._GenerateOutpath(f, '_rgb.csv')
data = np.loadtxt(colorFile, delimiter=Separator)
nRows = data.shape[0]
ind = np.arange(nRows)
vals = np.ones((nRows, 1))
ax = fig.add_subplot(nFiles, 1, i+1)
ax.set_title(pltTitle, fontsize=BigFont, loc='left')
ax.axis('off')
ax.bar(ind, vals, width=1.0, edgecolor='none', color=data)
figout = plt.gcf()
plt.savefig(self.args.outFile, dpi=300, bbox_inches='tight')
The script hangs for the whole night. On average my data are all ~10,000 x 3 to ~30,000 x 3 matrix.
In my case, I don't think I can use memmapfile to avoid memory hog because the subplot seems to be the problem here, not the data imported each loop.
I have no idea where to start to optimize this workflow.
I could, however, forget about subplots and generate one plot image per data at a time, and stitch the 40 images later, but that is not ideal.
Is there an easy way in matplotlib to do this?

Your problem is the way you're plotting your data.
Using bar to plot tens of thousands of bars of exactly the same size is very inefficient compared to using imshow to accomplish the same thing.
For example:
import numpy as np
import matplotlib.pyplot as plt
# Random r,g,b data similar to what you seem to be loading in....
data = np.random.random((30000, 3))
# Make data a 1 x size x 3 array
data = data[None, ...]
# Plotting using `imshow` instead of `bar` will be _much_ faster.
fig, ax = plt.subplots()
ax.imshow(data, interpolation='nearest', aspect='auto')
plt.show()
This should be essentially equivalent to what you're currently doing, but will draw much faster and use less memory.

Related

Linearly scale axes from kilometers to meters for all plots in matplotlib

I am working with data in meters and want to plot positions. Having the ticks in meters is obscuring the readability of the plots, so I want to plot the data in kilometers. I know that it is possible to scale all data d/1000., however, this makes the code less readable in my eyes, especially if you're plotting many different lines, where you have to do this transformation every time.
I am looking for a general way to achieve this type of transformation, I could imagine there is a beautiful way to achieve this in matplotlib.
Some sample code for you:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(7500, 30000, 300)
y_ref = np.linspace(5000, 15000, 300)
y_noised = y_ref + np.random.normal(0, 250, size=y_ref.size)
fig = plt.Figure(figsize=(6,6))
ax = fig.add_subplot(1, 1, 1)
ax.plot(x, y_ref, c='r')
ax.scatter(x, y_noised, alpha=0.2)
fig
I would like to have the following figure, without needing to scale x, y_ref and y_noised individually by 1000.
Is there a way to perform this transformation in matplotlib, such that you only need to do it for each figure once, no matter how many lines you plot?
You could use a custom tick formatter like that (passing a function into set_major_formatter creates a FuncFormatter):
m2km = lambda x, _: f'{x/1000:g}'
ax.xaxis.set_major_formatter(m2km)
ax.yaxis.set_major_formatter(m2km)

Automatically determine plot size matplotlib [duplicate]

This question already has an answer here:
Inconsistent figsize resizing in matplotlib
(1 answer)
Closed 3 years ago.
I am trying to generate a bar chart with lots of bars. If I keep the figsize at defaults, the data is squeezed together and the plot is unusable.
I have the following code snippet to reproduce my problem:
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = fig.add_subplot(111)
N=100
# Example data
labels = [chr(x) for x in range(N)]
y_pos = np.arange(len(labels))
performance = 3 + 10 * np.random.rand(len(labels))
error = np.random.rand(len(labels))
ax.barh(y_pos, performance, xerr=error, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.set_xlabel('Performance')
ax.set_title('How fast do you want to go today?')
plt.savefig('a.png', bbox_inches='tight')
plt.show()
If I manually set the height of the figure (for example figsize=(8,N*0.2)), the plotted data looks nice, but there is an annoying vertical whitespace before the firs bar and after the last one.
Is there any way to automatically size the plot properly?
One thing I've used is
plt.tight_layout()
it generates less whitespace for subplots, but may work with just 1 plot.
Here's more info:
https://matplotlib.org/users/tight_layout_guide.html
Another thing that may work is aspect auto when showing the plot.
plt.imshow(X, aspect='auto')
Yet another solution is 'scaled' axis
plt.axis('scaled') #this line fits your images to screen
Also if you mean the overall plot size,I generally just pick a generic size that fits, say 15 x 10 on a laptop screen, or 30x20 on a monitor. Guess and test.

Scale colormap for contour and contourf

I'm trying to plot the contour map of a given function f(x,y), but since the functions output scales really fast, I'm losing a lot of information for lower values of x and y. I found on the forums to work that out using vmax=vmax, it actually worked, but only when plotted for a specific limit of x and y and levels of the colormap.
Say I have this plot:
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
u = np.linspace(-2,2,1000)
x,y = np.meshgrid(u,u)
z = (1-x)**2+100*(y-x**2)**2
cont = plt.contour(x,y,z,500,colors='black',linewidths=.3)
cont = plt.contourf(x,y,z,500,cmap="jet",vmax=100)
plt.colorbar(cont)
plt.show
I want to uncover whats beyond the axis limits keeping the same scale, but if I change de x and y limits to -3 and 3 I get:
See how I lost most of my levels since my max value for the function at these limits are much higher. A work around to this problem is to increase the levels to 1000, but that takes a lot of computational time.
Is there a way to plot only the contour levels that I need? That is, between 0 and 100.
An example of a desired output would be:
With the white space being the continuation of the plot without resizing the levels.
The code I'm using is the one given after the first image.
There are a few possible ideas here. The one I very much prefer is a logarithmic representation of the data. An example would be
from matplotlib import ticker
fig = plt.figure(1)
cont1 = plt.contourf(x,y,z,cmap="jet",locator=ticker.LogLocator(numticks=10))
plt.colorbar(cont1)
plt.show()
fig = plt.figure(2)
cont2 = plt.contourf(x,y,np.log10(z),100,cmap="jet")
plt.colorbar(cont2)
plt.show()
The first example uses matplotlibs LogLocator functions. The second one just directly computes the logarithm of the data and plots that normally.
The third example just caps all data above 100.
fig = plt.figure(3)
zcapped = z.copy()
zcapped[zcapped>100]=100
cont3 = plt.contourf(x,y,zcapped,100,cmap="jet")
cbar = plt.colorbar(cont3)
plt.show()

an empty pdf created when a figure with subplots was saved as pdf from matpolotlib in python3.2

I need to save a figure (with 8 subplots on it) generated from matpolotlib in python3.2. I need to same save the figure on one pdf page.
Each subplot nmay have 240k to 400k data points.
My code:
from matplotlib.backends.backend_pdf import PdfPages
plt.show(block=False)
pp = PdfPages('multipage.pdf')
fig = plt.figure()
fig.savefig('figure_1.pdf', dpi = fig.dpi)
pp.close()
But, only an empty pdf file was created and no figures on it.
Any help would be appreciated.
UPDATE
This is a demo code:
def plot_pdf_example():
fig = plt.figure()
# I create subplots here
#x = np.random.rand(50)
#y = np.random.rand(50)
plt.plot(x, y, '.')
fig.savefig('figure_b.pdf')
if __name__ == '__main__':
r = plot_pdf_example()
# the return value of r is not 0 for my case
print("donne")
If I used plt.show() to get the figure in pop-up window, there are some titles and legends overlaps between subplots. How to adjuse the pop-up figure so that I can get all subplots without any overlaps and also keep all subplots as square. keeping them as square uis very important for me.
Your code does save the single and empty figure fig to the file figure_1.pdf, without making any use of PdfPages. It is also normal that the pdf file is empty, since you are not plotting anything in fig. Below is a MWE that shows how to save only one figure to a single pdf file. I've removed all the stuff with PdfPages that was not necessary.
Update (2015-07-27): When there is some problem saving a fig to pdf because there is too much data to render or in the cases of complex and detailed colormaps, it may be a good idea to rasterize some of the elements of the plot that are problematic. The MWE below has been updated to reflect that.
import matplotlib.pyplot as plt
import numpy as np
import time
plt.close("all")
fig = plt.figure()
N = 400000
x = np.random.rand(400000)
y = np.random.rand(400000)
colors = np.random.rand(400000)
area = 3
ax0 = fig.add_axes([0.1, 0.1, 0.85, 0.85])
scater = ax0.scatter(x, y, s=area, c=colors)
scater.set_rasterized(True)
plt.show(block=False)
ts = time.clock()
fig.savefig('figure_1.pdf')
te = time.clock()
print('t = %f sec' % (te-ts))
On my machine, the code above took about 6.5 sec to save the pdf when rasterized was set to true for scater, while it took 61.5 sec when it was set to False.
By default, when saving in pdf, the figure is saved in vectorial format. This means that every point is saved as a set of parameters (colors, size, position, etc). This is a lot of information to store when there is a lot of data (8 * 400k in the case of the OP). When converting some elements of the plot to a raster format, the number of points plotted does not matter because the image is saved as a fixed number of pixels instead (like in a png). By only rasterizing the scater, the rest of the figure (axes, labels, text, legend, etc.) still remains in vectorial format. So overall, the loss in quality is not that much noticeable for some type of graph (like colormaps or scatter plots), but it will be for graphs with sharp lines.

how to change the colors of multiple subplots at once?

I am looping through a bunch of CSV files containing various measurements.
Each file might be from one of 4 different data sources.
In each file, I merge the data into monthly datasets, that I then plot in a 3x4 grid. After this plot has been saved, the loop moves on and does the same to the next file.
This part I got figured out, however I would like to add a visual clue to the plots, as to what data it is. As far as I understand it (and tried it)
plt.subplot(4,3,1)
plt.hist(Jan_Data,facecolor='Red')
plt.ylabel('value count')
plt.title('January')
does work, however this way, I would have to add the facecolor='Red' by hand to every 12 subplots. Looping through the plots wont work for this situation, since I want the ylabel only for the leftmost plots, and xlabels for the bottom row.
Setting facecolor at the beginning in
fig = plt.figure(figsize=(20,15),facecolor='Red')
does not work, since it only changes the background color of the 20 by 15 figure now, which subsequently gets ignored when I save it to a PNG, since it only gets set for screen output.
So is there just a simple setthecolorofallbars='Red' command for plt.hist(… or plt.savefig(… I am missing, or should I just copy n' paste it to all twelve months?
You can use mpl.rc("axes", color_cycle="red") to set the default color cycle for all your axes.
In this little toy example, I use the with mpl.rc_context block to limit the effects of mpl.rc to just the block. This way you don't spoil the default parameters for your whole session.
import matplotlib as mpl
import matplotlib.pylab as plt
import numpy as np
np.random.seed(42)
# create some toy data
n, m = 2, 2
data = []
for i in range(n*m):
data.append(np.random.rand(30))
# and do the plotting
with mpl.rc_context():
mpl.rc("axes", color_cycle="red")
fig, axes = plt.subplots(n, m, figsize=(8,8))
for ax, d in zip(axes.flat, data):
ax.hist(d)
The problem with the x- and y-labels (when you use loops) can be solved by using plt.subplots as you can access every axis seperately.
import matplotlib.pyplot as plt
import numpy.random
# creating figure with 4 plots
fig,ax = plt.subplots(2,2)
# some data
data = numpy.random.randn(4,1000)
# some titles
title = ['Jan','Feb','Mar','April']
xlabel = ['xlabel1','xlabel2']
ylabel = ['ylabel1','ylabel2']
for i in range(ax.size):
a = ax[i/2,i%2]
a.hist(data[i],facecolor='r',bins=50)
a.set_title(title[i])
# write the ylabels on all axis on the left hand side
for j in range(ax.shape[0]):
ax[j,0].set_ylabel(ylabel[j])
# write the xlabels an all axis on the bottom
for j in range(ax.shape[1]):
ax[-1,j].set_xlabel(xlabels[j])
fig.tight_layout()
All features (like titles) which are not constant can be put into arrays and placed at the appropriate axis.

Categories

Resources