Heat maps of large data with pyplot slow

Heat maps of large data with pyplot slow - python

I am working with a big set of data (240, 131000) and bigger. I am currently using the code below to plot this.
fig,ax = pyplot.subplots()
spectrum = ax.pcolor(waterfallplot, cmap='viridis')
pyplot.colorbar()
pyplot.show()
However, it's taking a very long time (30 min+) and the plot still hasn't shown up yet. A quick breakpoint check says the code gets to the spectrum= line, but doesn't go past. Looking at the memory on my computer, it hasn't even gotten close to the limit.
Does anyone have a better way of doing this?

pcolorfast works best for large arrays and updates quickly.

Related

How can I improve the look of scipy's spectrograms?

I need to generate spectrograms for audio files with Python and I'm following the solution given here. However, the spectrograms I'm getting don't look very "populated," and not at all like other spectrograms I get from other software.
This is the code I used for the particular image I'm showing here:
import matplotlib.pyplot as plt
from matplotlib import cm
from scipy import signal
from scipy.io import wavfile
sample_rate, samples = wavfile.read('audio-mono.wav')
frequencies, times, spectrogram = signal.spectrogram(samples[:700000], sample_rate)
cMap = cm.get_cmap('gray', 3000) # Maybe I'm not understanding this very well
fig = plt.figure(figsize=(4,2), dpi=400, frameon=False)
plt.pcolormesh(times, frequencies, spectrogram, cmap=cMap)
plt.savefig('spectrogram.png')
The following images are spectrograms from Audacity and Aegisub, respectively, both for the same file for which the third image's spectrogram was created (with scipy).
To create this spectrogram, trying to see if it was a figure-size/resolution issue, I tried a couple of things things, one by one, and the end result is this image (with both of them applied).
First, when extracting the .wav file from the .mp4 file, I set the sampling rate to 10 KHz to avoid having such a big y-axis in the plot and see if this helps. This is why you see a max of 5,000. I though I could live with some frequencies neglected given that I care, most of all, about speech frequencies.
Then, to get a better zoom, I created a spectrogram with only the first 700,000 elements of the samples array (see code), which, in the case of this file, represent about 70 seconds. This didn't help either. I even tried to create the spectrogram with the same slice of the samples array, but by taking only every tenth value, then every twentieth, and so on, but this only made the spectrogram have horizontal lines instead of dots. This is not applied here in the figure I'm showing you, because I realized it's far from helping. I also tinkered with the figure size and the resolution, but it didn't really help either.
As you can see in the first figure, the y-axis goes from 0 to 5 KHz, and many frequencies have some intensity at that level. Also, the only moment in that 70-second span with complete silence is around the 35-second mark. The accuracy of this becomes obvious when listening to the file.
In the second figure, there is no y-axis mark, but I can see it has a bigger range than the 5 KHz, which I think accounts for the difference with the first figure. I'm pretty sure that, unfortunately, I can't change this view range. However, this spectrogram also shows the moment of complete silence accurately, and it is at least properly "populated" in the rest of it.
By seeing the third figure (the one I generated with scipy), one could easily think there are several parts of complete silence in those first 70 seconds, which is far from true. I'd like it to look more like those above it, because I know they're much more accurate, but I don't really know how I can do it, and this one won't work at all.
I'm pretty sure there is something I can do, but I think I still don't know scipy enough to know what it is.
Thanks in advance.
EDIT 1
PLOTTED THE SPECTROGRAM WITHOUT SPECIFYING A COLORMAP
You can see the plot looks a bit more populated, but still not even close to the other ones.
EDIT 2
Considering the idea given in the first comment of this question, I used a manipulated version of the gray colormap to have black as the first entry (as normal) but with the second entry being the color that's normally halfway, and then 2,999 colors from there up to white. Please excuse me if I'm using wrong terminology here or if this is not correctly phrased. I'm still trying to understand how to work with colormaps.
The code used to create and plot the spectrogram is the same. The only difference is the colormap used, which I manipulated as follows:
import numpy as np
from matplotlib.colors import ListedColormap
cMap = cm.get_cmap('gray', 3000)
new_colors = cMap(np.linspace(0.5, 1, 3000))
black = [0, 0, 0, 1]
new_colors[0, :] = black
new_cmp = ListedColormap(new_colors)
Using new_cmp as the colormap for the pcolormesh() function, I get the following spectrogram.
This is much, much better than the original, and looks much more like the ones from Audacity and Aegisub. However, I'd like to know if there is a better approach I can take to make my spectrograms look better, if there could be something else that's causing this to not look so much as the sample ones, and if there is a better way to do what I did with the colormap. As I said, I'm still struggling with them.
EDIT 3
I'm now sharing the audio I used to create these spectrograms here.

Making the marker size as small as possible with matplotlib.pyplot

My program can generate up to 2000000 points within 5 seconds, so I don't have a problem with speed. Right now I am using matplotlib.pyplot.scatter to scatter all my points on my graph. For s=1, it gives me small circles, but not small enough, because it does not show the intricate patterns. When I use s=0.1, it gives me this weird marker shape:
Which makes the marker larger despite me making the size smaller. I have searched all over the internet including stack overflow, but they do not tell how to minimize the size further. Unfortunately, I have to show all the points, and cannot just show a random sample of them.
I have come to the conclusion that matplotlib is made for a small sample of points and not meant for plotting millions of points. However, if it is possible to make the size smaller please let me know.
Anyway, for my points, I have all the x values in order in one array, and all the y values in order in another array. Could someone suggest a graphing package in python I could use to graph all the points in a way that the size would be very small since when I plot the points now it just becomes one big block of color instead of intricate designs forming in the shape as they should be.
Thanks for any help in advance!
EDIT: My code that I am using to scatter the points is:
plt.savefig(rootDir+"b"+str(Nvertices)+"_"+str(xscale)+"_"+str(yscale)+"_"+str(phi)+"_"+str(psi)+"_"+CurrentRun+"_color.png", dpi =600)
EDIT: I got my answer, I added linewidths = 0 and that significantly reduced the size of the points, giving me what I needed.

Perhaps you can try making the linewidths as 0 i.e., the line width of the marker edges. Notice the difference in the two plots below
fig, ax = plt.subplots(figsize=(6, 4))
plt.scatter(np.random.rand(100000), np.random.rand(100000), s=0.1)
fig, ax = plt.subplots(figsize=(6, 4))
plt.scatter(np.random.rand(100000), np.random.rand(100000), s=0.1, linewidths=0)

you can set the marker to a single pixel using marker=',' in your call to scatter.
See the markers documentation here

How can I work around overflow error in matplotlib?

I'm solving a set of coupled differential equations with odeint package from scipy.integrate.
For the integration time I have:
t=numpy.linspace(0,8e+9,5e+06)
where 5e+06 is the timestep.
I then plot the equations I have as such:
plt.xscale('symlog') #x axis logarithmic scale
plt.yscale('log',basey=2) #Y axis logarithmic scale
plt.gca().set_ylim(8, 100000) #Changing y axis ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(matplotlib.ticker.ScalarFormatter())
ax.xaxis.set_major_formatter(matplotlib.ticker.ScalarFormatter())
plt.title("Example graph")
plt.xlabel("time (yr)")
plt.ylabel("quantity a")
plt.plot(t,a,"r-", label = 'Example graph')
plt.legend(loc='best')
where a is time dependent variable. (This is just one graph from many.)
However, the graphs look a bit jagged, rather than oscillatory and I obtain this error:
OverflowError: Exceeded cell block limit (set 'agg.path.chunksize' rcparam)
I'm not overly sure what this error means, I've looked at other answers but don't know how to implement the 'agg.path.chunksize'.
Also, the integration + plotting takes around 7 hours and that is with some CPU processing hacks, so I really do not want to implement anything that would increase the time.
How can I overcome this error?
I have attempted to reduce the timestep, however I obtain this error instead:
Excess work done on this call (perhaps wrong Dfun type).
Run with full_output = 1 to get quantitative information.

As the error message suggests, you may set the chunksize to a larger value.
plt.rcParams['agg.path.chunksize'] = 1000
However you may also critically reflect why this error occurs in the first place. It would only occur if you are trying to plot an unreasonably large amount of data on the graph. Meaning, if you try to plot 200000000 points, the renderer might have problems to keep them all in memory. But one should probably ask oneself, why is it necessary to plot so many points? A screen may display some 2000 points in lateral direction, a printed paper maybe 6000. Using more points does not make sense, generally speaking.
Now if the solution of your differential equations requires a large point density, it does not automatically mean that you need to plot them all.
E.g. one could just plot every 100th point,
plt.plot(x[::100], y[::100])
most probably without even affecting the visual plot appearance.

What are ways to speed up seaborns pairplot

I have a dataframe with 250.000 rows but 140 columns and I'm trying to construct a pair plot. of the variables.
I know the number of subplots is huge, as well as the time it takes to do the plots. (I'm waiting for more than an hour on an i5 with 3,4 GHZ and 32 GB RAM).
Remebering that scikit learn allows to construct random forests in parallel, I was checking if this was possible also with seaborn.
However, I didn't find anything. The source code seems to call the matplotlib plot function for every single image.
Couldn't this be parallelised? If yes, what is a good way to start from here?

Rather than parallelizing, you could downsample your DataFrame to say, 1000 rows to get a quick peek, if the speed bottleneck is indeed occurring there. 1000 points is enough to get a general idea of what's going on, usually.
i.e. sns.pairplot(df.sample(1000)).

Save your pairplot to image and then show this image instead of rendering it all in your browser.
from IPython.display import Image
import seaborn as sns
import matplotlib.pyplot as plt
sns_plot = sns.pairplot(df, size=2.0)
sns_plot.savefig("pairplot.png")
plt.clf() # Clean parirplot figure from sns
Image(filename='pairplot.png') # Show pairplot as image

For me, I had a situation where the histograms were taking a very long time due to the variance in the data. I only had 1200 rows and 4 columns, but it took half an hour before I gave up. I think it was so spread out and unordered that the histogram was constantly updating. One workaround might be to play with the bin parameter, but my solution was to use a KDE for the diagonal instead. With the KDE, it takes only a few seconds.
sns.pairplot(df, diag_kind='kde')

Having issues with matplolib for loop getting slower

I want to loop over a series of images to see how they change over time. Thus, I want them plotted on the same figure. The following code works but seems to slow down after a few iterations. Does anyone know why this is happening, how to overcome it, or an alternative way to visualize these images over time?
fig, ax=pyplot.subplots(figsize=(8,6))
for i in range(n):
ax.imshow(imageArray[i])
fig.canvas.draw()
time.sleep(0.2)

The animation is getting slower because the old image isn't deleted. More and more images will be redrawn each time you call fig.canvas.draw(). Therefore add ax.cla() before the imshow call. The tutorial that Jake suggested doesn't need the cla because it sets the image directly, and will therefore be slightly faster.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Heat maps of large data with pyplot slow - python

pcolorfast works best for large arrays and updates quickly.

Related

How can I improve the look of scipy's spectrograms?

Making the marker size as small as possible with matplotlib.pyplot

How can I work around overflow error in matplotlib?

What are ways to speed up seaborns pairplot

Having issues with matplolib for loop getting slower

Categories

Resources