Interactive large plot with ~20 million sample points and gigabytes of data - python
I have got a problem (with my RAM) here: it's not able to hold the data I want to plot. I do have sufficient HD space. Is there any solution to avoid that "shadowing" of my data-set?
Concretely I deal with Digital Signal Processing and I have to use a high sample-rate. My framework (GNU Radio) saves the values (to avoid using too much disk space) in binary. I unpack it. Afterwards I need to plot. I need the plot zoomable, and interactive. And that is an issue.
Is there any optimization potential to this, or another software/programming language (like R or so) which can handle larger data-sets? Actually I want much more data in my plots. But I have no experience with other software. GNUplot fails, with a similar approach to the following. I don't know R (jet).
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
import struct
"""
plots a cfile
cfile - IEEE single-precision (4-byte) floats, IQ pairs, binary
txt - index,in-phase,quadrature in plaintext
note: directly plotting with numpy results into shadowed functions
"""
# unpacking the cfile dataset
def unpack_set(input_filename, output_filename):
index = 0 # index of the samples
output_filename = open(output_filename, 'wb')
with open(input_filename, "rb") as f:
byte = f.read(4) # read 1. column of the vector
while byte != "":
# stored Bit Values
floati = struct.unpack('f', byte) # write value of 1. column to a variable
byte = f.read(4) # read 2. column of the vector
floatq = struct.unpack('f', byte) # write value of 2. column to a variable
byte = f.read(4) # next row of the vector and read 1. column
# delimeter format for matplotlib
lines = ["%d," % index, format(floati), ",", format(floatq), "\n"]
output_filename.writelines(lines)
index = index + 1
output_filename.close
return output_filename.name
# reformats output (precision configuration here)
def format(value):
return "%.8f" % value
# start
def main():
# specify path
unpacked_file = unpack_set("test01.cfile", "test01.txt")
# pass file reference to matplotlib
fname = str(unpacked_file)
plt.plotfile(fname, cols=(0,1)) # index vs. in-phase
# optional
# plt.axes([0, 0.5, 0, 100000]) # for 100k samples
plt.grid(True)
plt.title("Signal-Diagram")
plt.xlabel("Sample")
plt.ylabel("In-Phase")
plt.show();
if __name__ == "__main__":
main()
Something like plt.swap_on_disk() could cache the stuff on my SSD ;)
So your data isn't that big, and the fact that you're having trouble plotting it points to issues with the tools. Matplotlib has lots of options and the output is fine, but it's a huge memory hog and it fundamentally assumes your data is small. But there are other options out there.
So as an example, I generated a 20M data-point file 'bigdata.bin' using the following:
#!/usr/bin/env python
import numpy
import scipy.io.numpyio
npts=20000000
filename='bigdata.bin'
def main():
data = (numpy.random.uniform(0,1,(npts,3))).astype(numpy.float32)
data[:,2] = 0.1*data[:,2]+numpy.exp(-((data[:,1]-0.5)**2.)/(0.25**2))
fd = open(filename,'wb')
scipy.io.numpyio.fwrite(fd,data.size,data)
fd.close()
if __name__ == "__main__":
main()
This generates a file of size ~229MB, which isn't all that big; but you've expressed that you'd like to go to even larger files, so you'll hit memory limits eventually.
Let's concentrate on non-interactive plots first. The first thing to realize is that vector plots with glyphs at each point are going to be a disaster -- for each of the 20 M points, most of which are going to overlap anyway, trying to render little crosses or circles or something is going to be a diaster, generating huge files and taking tonnes of time. This, I think is what is sinking matplotlib by default.
Gnuplot has no trouble dealing with this:
gnuplot> set term png
gnuplot> set output 'foo.png'
gnuplot> plot 'bigdata.bin' binary format="%3float32" using 2:3 with dots
And even Matplotlib can be made to behave with some caution (choosing a raster back end, and using pixels to mark points):
#!/usr/bin/env python
import numpy
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
datatype=[('index',numpy.float32), ('floati',numpy.float32),
('floatq',numpy.float32)]
filename='bigdata.bin'
def main():
data = numpy.memmap(filename, datatype, 'r')
plt.plot(data['floati'],data['floatq'],'r,')
plt.grid(True)
plt.title("Signal-Diagram")
plt.xlabel("Sample")
plt.ylabel("In-Phase")
plt.savefig('foo2.png')
if __name__ == "__main__":
main()
Now, if you want interactive, you're going to have to bin the data to plot, and zoom in on the fly. I don't know of any python tools that will help you do this offhand.
On the other hand, plotting-big-data is a pretty common task, and there are tools that are up for the job. Paraview is my personal favourite, and VisIt is another one. They both are mainly for 3D data, but Paraview in particular does 2d as well, and is very interactive (and even has a Python scripting interface). The only trick will be to write the data into a file format that Paraview can easily read.
A survey of open source interactive plotting software with a 10 million point scatter plot benchmark on Ubuntu
Inspired by the use case described at: https://stats.stackexchange.com/questions/376361/how-to-find-the-sample-points-that-have-statistically-meaningful-large-outlier-r I have benchmarked a few plotting programs with the exact same input files.
Basically, I wanted to:
do an XY scatter plot of multidimensional data, hopefully with Z as the point color
interactively select some interesting looking points from the plot with my mouse
view all dimensions of the selected points (including at least X, Y and Z) to try and understand why they are outliers in the XY scatter
That problem can be represented by the following simplified test data:
i=0;
while [ "$i" -lt 10000000 ]; do
echo "$i,$((2 * i)),$((4 * i))"; i=$((i + 1));
done > 10m1.csv
echo 5000000,20000000,-1 >> 10m1.csv
The first few lines of 10m1.csv look like this:
10m1.csv
0,0,0
1,2,4
2,4,8
3,6,12
4,8,16
and the very last one, the 10 million-first, is the outlier, and looks like:
5000000,20000000,-1
so we basically have:
a line with inclination 2 and 10 million points on it
plus a single outlier point outside of the line, on the top center of the plot
something like:
Y
^
|
|
| + +
|
| +
|
| +
|
| +
|
| +
|
| +
|
| +
|
| +
|
+-------------------> X
And the goal of this benchmark to find the point (5000000,20000000) on the graphical plot, and then determine the value of the third column from it, which is -1 in our test.
When I first wrote this answer, I had used 10.csv generated with:
i=0;
while [ "$i" -lt 10000000 ]; do
echo "$i,$((2 * i)),$((4 * i))"; i=$((i + 1));
done > 10m.csv
without the outlier. While this tests performance, it does not test selection capabilities, so the goal is to migrate each test to 10m1.csv when I find motivation to do it.
I also made a 10 point + outlier example in case I wanted to evaluate usability for some tool that could not handle the 10m point count:
i=0;
while [ "$i" -lt 10 ]; do
echo "$i,$((2 * i)),$((4 * i))"; i=$((i + 1));
done > 11.csv
echo 5,20,-1 >> 11.csv
To have extra fun, I also prepared an even larger 1 billion point dataset in case any of the programs could handle the 10 million points! CSV files were getting a bit wonky, so I moved to HDF5:
#!/usr/bin/env python3
import h5py
import numpy
size = 1000000000
with h5py.File('1b.hdf5', 'w') as f:
x = numpy.arange(size + 1)
x[size] = size / 2
f.create_dataset('x', data=x, dtype='int64')
y = numpy.arange(size + 1) * 2
y[size] = 3 * size / 2
f.create_dataset('y', data=y, dtype='int64')
z = numpy.arange(size + 1) * 4
z[size] = -1
f.create_dataset('z', data=z, dtype='int64')
This produces a ~23GiB file analogous to 10m1.csv containing:
1 billion points in a straight line much like 10m.csv
one outlier point at the center top of the graph
The tests were carried out in Ubuntu 18.10 unless mentioned otherwise in the a subsection, in a ThinkPad P51 laptop with Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB), NVIDIA Quadro M1200 4GB GDDR5 GPU.
Summary of results
This is what I observed, considering my very specific test use case and that I'm a first time user of many of the reviewed software:
Does it handle 10 million points:
Vaex Yes, tested up to 1 Billion!
VisIt Yes, but not 100m
Paraview No
Mayavi Yes
gnuplot Barely on non-interactive mode.
matplotlib No
Bokeh No, up to 1m
PyViz ?
seaborn ?
Does it have a lot of features:
Vaex Yes.
VisIt Yes, 2D and 3D, focus on interactive.
Paraview Same as above, a bit less 2D features maybe.
Mayavi 3D only, good interactive and scripting support, but more limited features.
gnuplot Lots of features, but limited in interactive mode.
matplotlib Same as above.
Bokeh Yes, easy to script.
PyViz ?
seaborn ?
Does the GUI feel good (not considering good performance):
Vaex Yes, Jupyter widget
VisIt No
Paraview Very
Mayavi OK
gnuplot OK
matplotlib OK
Bokeh Very, Jupyter widget
PyViz ?
seaborn ?
Vaex 2.0.2
https://github.com/vaexio/vaex
Install and get a hello world working as shown at: How to do interactive 2D scatter plot zoom / point selection in Vaex?
I tested vaex with up to 1 billion points and it worked, it is awesome!
It is "Python-scripted-first" which is great for reproducibility, and allows me to easily interface with other Python things.
The Jupyter setup has a few moving parts, but once I got it running with virtualenv, it was amazing.
To load our CSV run in Jupyter:
import vaex
df = vaex.from_csv('10m.csv', names=['x', 'y', 'z'],)
df.plot_widget(df.x, df.y, backend='bqplot')
and we can see instantly:
Now, we can zoom, pan and select points with the mouse, and updates are really fast, all in under 10 seconds. Here I have zoomed in to see some individual points and have selected a few of them (faint lighter rectangle on image):
After the selection is made with the mouse, this has the exact same effect as using the df.select() method. So we can extract the selected points by running in Jupyter:
df.to_pandas_df(selection=True)
which outputs data with format:
x y z index
0 4525460 9050920 18101840 4525460
1 4525461 9050922 18101844 4525461
2 4525462 9050924 18101848 4525462
3 4525463 9050926 18101852 4525463
4 4525464 9050928 18101856 4525464
5 4525465 9050930 18101860 4525465
6 4525466 9050932 18101864 4525466
Since 10M points worked fine, I decided to try 1B points... and it also worked fine!
import vaex
df = vaex.open('1b.hdf5')
df.plot_widget(df.x, df.y, backend='bqplot')
To observe the outlier, which was invisible on the original plot, we can follow How change the point style in a vaex interactive Jupyter bqplot plot_widget to make individual points larger and visible? and use:
df.plot_widget(df.x, df.y, f='log', shape=128, backend='bqplot')
which produces:
and after selecting the point:
we obtain the outlier's full data:
x y z
0 500000000 1500000000 -1
Here is a demo by the creators with a more interesting dataset and more features: https://www.youtube.com/watch?v=2Tt0i823-ec&t=770
Tested in Ubuntu 19.04.
VisIt 2.13.3
Website: https://wci.llnl.gov/simulation/computer-codes/visit
License: BSD
Developed by Lawrence Livermore National Laboratory, which is a National Nuclear Security Administration laboratory, so you can imagine that 10m points will be nothing for it if I could get it working. (The book The Supermen: The Story of Seymour Cray by Charles J. Murray (1997) does a good job a showing how computational power hungry labs such as these were when building the first H bombs, because you can't just run experiments at will with nukes, and even if you do, you can't really measure what you would like because it blows up too fast and too hot: a computer model is a must. And they decided that a bunch of physicist's wives with calculators wasn't going to cut it like as for the earlier Los Alamos fission bomb. When Israel bought one of their computers, everyone immediately assumed it was to make nukes.)
Installation: there is no Debian package, just download Linux binaries from website. Runs without installing. See also: https://askubuntu.com/questions/966901/installing-visit
Based on VTK which is the backend library that many of the high perfomance graphing software use. Written in C.
After 3 hours of playing with the UI, I did get it working, and it did solve my use case as detailed at: https://stats.stackexchange.com/questions/376361/how-to-find-the-sample-points-that-have-statistically-meaningful-large-outlier-r
Here is how it looks like on the test data of this post:
and a zoom with some picks:
and here is the picks window:
Performance wise, VisIt was very good: every graphic operation either took only a small amount of time or was immediate. When I had to wait, it shows a "processing" message with the percentage of work left, and the GUI didn't freeze.
Since 10m points worked so well, I also tried 100m points (a 2.7G CSV file) but it crashed / went into a weird state unfortunately, I watched it in htop as the 4 VisIt threads took up all of my 16GiB RAM and died likely due to a failed malloc.
The initial getting started was a bit painful:
many of the defaults feel atrocious if you are not a nuclear bomb engineer? E.g.:
default point size 1px (gets confused with dust on my monitor)
axes scale from 0.0 to 1.0: How to show the actual axes number values on the Visit plotting program instead of fractions from 0.0 to 1.0?
multi-window setup, nasty multi popups when you Pick data points
shows your username and plot date (remove with "Controls" > "Annotation" > "User information")
automatic positioning defaults are bad: legend conflicts with axes, could not find title automation so had to add a label and reposition everything by hand
there are just a lot of features, so it can be hard to find what you want
the manual was very helpful, but it is a 386 page PDF mammoth ominously dated "October 2005 Version 1.5". I wonder if they used this to develop Trinity! and it is a nice Sphinx HTML created just after I originally answered this question
no Ubuntu package. But the prebuilt binaries did just work.
I attribute these problems to:
it has been around for such a long time and uses some outdated GUI ideas
you can't just click on the plot elements to change them (e.g. axes, title, etc.), and there are a lot of features, so it is a bit hard to find the one your are looking for
I also love it how a bit of LLNL infrastructure leaks into that repo. See for example docs/OfficeHours.txt and other files in that directory! I'm sorry for Brad who is the "Monday Morning guy"! Oh, and the password for the answering machine is "Kill Ed", don't forget that.
Paraview 5.9.0
Website: https://www.paraview.org/
License: BSD
Tested on: Ubuntu 20.10.
Installation:
sudo apt install paraview
or get the latest by download prebuilts from the website. This is what I did for this review, since the apt one was only at 5.7.0. I downloaded ParaView-5.9.0-MPI-Linux-Python3.8-64bit.tar.gz.
Developed by Kitware and Los Alamos National Laboratory, and later Sandia National Laboratories (so the other two NNSA labs), so once again we expect that it will easily handle the data. Also VTK based and written in C++, which was further promising.
However I was disappointed: for some reason, 10m points made the GUI very slow and unresponsive, making it unusable. Whenever I clicked something, like to hide the lines, it took several dozen seconds. I think that at some point it just glitched out and stopped responding at all.
I'm fine with a controlled well advertised "I'm working now, wait a bit" moment, but the GUI freezing while that happens? Not acceptable.
htop showed that Paraview was using 8 threads and 3GB RAM, so neither CPU nor memory was maxed out.
GUI-wise, Paraview is very nice and modern, way better than VisIt when it is not stuttering.
Since 10m1.csv killed it, I tested with 11.csv to see if I would have been able to solve my problem except for performance, and the answer is yes:
paraview 11.csv
select CSV reader from the popup
properties properties Apply on the left
right click on the CSV on Pipeline Browser
Add filter > Alphabetical > Plot data. Why is plotting a filter? Not very intuitive for first time users, related: paraview: plot data from csv file I'm sure it is one of those things that make sense once you understand further generalizations of what filters can do, but still.
properties > Apply
unselect "Use index for x axis"
X Array Name: Field 0
Series Parameters remove Field 0 and Field 2
Select Field 1 and:
Line style: None
Marker style: cross
Marker size: increase or decrease as needed
"Rectangle Selection (s)" icon above the plot
Select outlier (point is highlighted)
Add another filter to the plot filter: "Extract Selection"
Apply
And finally!!! I get a table containing only the selected outlier, and showing the value of "Field 2" as -1:
So yes, not exactly a walk in the park, but I managed eventually.
Another downside is that Paraview felt lacking features compared to VisIt, e.g.:
I could not find how to set the color of my scatter based on a third column: How to color scatter plot points by the value of a third column in Paraview like gnuplot palette?
Mayavi 4.6.2
Website: https://github.com/enthought/mayavi
Developped by: Enthought
Install:
sudo apt-get install libvtk6-dev
python3 -m pip install -u mayavi PyQt5
The VTK Python one.
Mayavi seems to be very focused on 3D, I could not find how to do 2D plots in it, so it does not cut it for my use case unfortunately.
Just to check performance however, I adapted the example from: https://docs.enthought.com/mayavi/mayavi/auto/example_scatter_plot.html for 10 million points, and it run just fine without lagging:
import numpy as np
from tvtk.api import tvtk
from mayavi.scripts import mayavi2
n = 10000000
pd = tvtk.PolyData()
pd.points = np.linspace((1,1,1),(n,n,n),n)
pd.verts = np.arange(n).reshape((-1, 1))
pd.point_data.scalars = np.arange(n)
#mayavi2.standalone
def main():
from mayavi.sources.vtk_data_source import VTKDataSource
from mayavi.modules.outline import Outline
from mayavi.modules.surface import Surface
mayavi.new_scene()
d = VTKDataSource()
d.data = pd
mayavi.add_source(d)
mayavi.add_module(Outline())
s = Surface()
mayavi.add_module(s)
s.actor.property.trait_set(representation='p', point_size=1)
main()
Output:
I couldn't however zoom in enough to see indivitual points, the near 3D plane was too far. Maybe there is a way?
One cool thing about Mayavi is that devs put a lot of effort into allowing you to fire and setup the GUI from a Python script nicely, much like Matplotlib and gnuplot. It seems that this is also possible in Paraview, but the docs are not as good at least.
Generally it feels not a featurefull as VisIt / Paraview. For example, I couldn't directly load a CSV from the GUI: How to load a CSV file from the Mayavi GUI?
Gnuplot 5.2.2
Website: http://www.gnuplot.info/
gnuplot is really convenient when I need to go quick and dirty, and it is always the first thing that I try.
Installation:
sudo apt-get install gnuplot
For non-interactive use, it can handle 10m points reasonably well:
#!/usr/bin/env gnuplot
set terminal png size 1024,1024
set output "gnuplot.png"
set key off
set datafile separator ","
plot "10m1.csv" using 1:2:3:3 with labels point
which finished in 7 seconds:
But if I try to go interactive with
#!/usr/bin/env gnuplot
set terminal wxt size 1024,1024
set key off
set datafile separator ","
plot "10m.csv" using 1:2:3 palette
and:
gnuplot -persist main.gnuplot
then the initial render and zooms feel too sluggish. I can't even see the rectangle selection line!
Also note that for my use case, I needed to use hypertext labels as in:
plot "10m.csv" using 1:2:3 with labels hypertext
but there was a performance bug with the labels feature including for non-interactive rendering. But I reported it, and Ethan solved it in a day: https://groups.google.com/forum/#!topic/comp.graphics.apps.gnuplot/qpL8aJIi9ZE
I must say however that there is one reasonable workaround for outlier selection: just add labels with the row ID to all points! If there are many points nearby, you won't be able to read the labels. But for the outliers which you care about, you just might! For example, if I add one outlier to our original data:
cp 10m.csv 10m1.csv
printf '2500000,10000000,40000000\n' >> 10m1.csv
and modify the plot command to:
#!/usr/bin/env gnuplot
set terminal png size 1024,1024
set output "gnuplot.png"
set key off
set datafile separator ","
plot "10.csv" using 1:2:3:3 palette with labels
This slowed down the plotting significantly (40 mins after the fix mentioned above!!!), but produces a reasonable output:
so with some data filtering, we would get there, eventually.
Matplotlib 1.5.1, numpy 1.11.1, Python 3.6.7
Website: https://matplotlib.org/
Matplotlib is what I usually try when my gnuplot script starts getting too insane.
numpy.loadtxt alone took about 10 seconds, so I knew this wasn't going to go well:
#!/usr/bin/env python3
import numpy
import matplotlib.pyplot as plt
x, y, z = numpy.loadtxt('10m.csv', delimiter=',', unpack=True)
plt.figure(figsize=(8, 8), dpi=128)
plt.scatter(x, y, c=z)
# Non-interactive.
#plt.savefig('matplotlib.png')
# Interactive.
plt.show()
First the non-interactive attempt gave good output, but took 3 minutes and 55 seconds...
Then the interactive one took a long time on initial render and on zooms. Not usable:
Notice on this screenshot how the zoom selection, which should immediately zoom and disappear stayed on screen for a long time while it waited for zoom to be calculated!
I had to comment out plt.figure(figsize=(8, 8), dpi=128) for the interactive version to work for some reason, or else it blew up with:
RuntimeError: In set_size: Could not set the fontsize
Bokeh 1.3.1
https://github.com/bokeh/bokeh
Ubuntu 19.04 install:
python3 -m pip install bokeh
Then launch Jupyter:
jupyter notebook
Now if I plot 1m points, everything works perfectly, the interface is awesome and fast, including zoom and on hover information:
from bokeh.io import output_notebook, show
from bokeh.models import HoverTool
from bokeh.transform import linear_cmap
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
import numpy as np
N = 1000000
source = ColumnDataSource(data=dict(
x=np.random.random(size=N) * N,
y=np.random.random(size=N) * N,
z=np.random.random(size=N)
))
hover = HoverTool(tooltips=[("z", "#z")])
p = figure()
p.add_tools(hover)
p.circle(
'x',
'y',
source=source,
color=linear_cmap('z', 'Viridis256', 0, 1.0),
size=5
)
show(p)
Initial view:
After a zoom:
If I go up to 10m though it chokes, htop shows that chromium has 8 threads taking up all my memory in uninterruptible IO state.
This asks about referencing the points: How to reference selected bokeh data points
PyViz
https://pyviz.org/
TODO evaluate.
Integrates Bokeh + datashader + other tools.
Video demoing 1B datapoints: https://www.youtube.com/watch?v=k27MJJLJNT4 "PyViz: Dashboards for Visualizing 1 Billion Datapoints in 30 Lines of Python" by "Anaconda, Inc." published on 2018-04-17.
seaborn
https://seaborn.pydata.org/
TODO evaluate.
There's already a QA on how to use seaborn to visualize at least 50 million rows.
A more recent project has strong potential for large data sets: Bokeh, which was created with exactly this in mind.
In fact, only the data that's relevant at the scale of the plot is sent to the display backend. This approach is much faster than the Matplotlib approach.
You can certainly optimize the reading of your file: you could directly read it into a NumPy array, so as to leverage the raw speed of NumPy. You have a few options. If RAM is an issue, you can use memmap, which keeps most of the file on disk (instead of in RAM):
# Each data point is a sequence of three 32-bit floats:
data = np.memmap(filename, mode='r', dtype=[('index', 'float32'), ('floati','float32'), ('floatq', 'float32')])
If RAM is not an issue, you can put the whole array in RAM with fromfile:
data = np.fromfile(filename, dtype=[('index', 'float32'), ('floati','float32'), ('floatq', 'float32')])
Plotting can then be done with Matplotlib's usual plot(*data) function, possibly through the "zoom in" method proposed in another solution.
I would suggest something a bit complex but that should work : build your graph at different resolutions, for different ranges.
Think of Google Earth, for example. If you unzoom at maximum level to cover the whole planet, the resolution is the lowest. When you zoom, the pictures change for more detailed ones, but just on the region you're zooming on.
So basically for your plot (is it 2D ? 3D ? I'll assume it's 2D), I suggest you build one big graph that covers the whole [0, n] range with low resolution, 2 smaller graphs that cover [0, n/2] and [n/2 + 1, n] with twice the resolution of the big one, 4 smaller graphs that cover [0, n/4] ... [3 * n / 4 + 1, n] with twice the resolution of the 2 above, and so on.
Not sure my explanation is really clear. Also, I don't know if this kind of multi-resolution graph is handled by any existing plot program.
I wonder if there's a win to be had by speeding up lookup of your points? (I've been intrigued by R* (r star) trees for a while.)
I wonder if using something like an r* tree in this case could be the way to go. (when zoomed out, higher up nodes in the tree could contain information about the coarser, zoomed out rendering, nodes further towards the leaves contain the individual samples)
maybe even memory map the tree (or whatever structure you end up using) into memory to keep your performance up and your RAM usage low. (you offload the task of memory management to the kernel)
hope that makes sense.. rambling a bit. it's late!
Related
How to open optimized GIF without bugs?
So this GIF looks perfectly fine before opening: But, when opened using Pillow using imageObject = Image.open(path.join(petGifs, f"{pokemonName}.gif")) it bugs out, adding various boxes that have colors similar to that of the source image. This is an example frame, but almost every frame is different, and it's in different spots depending on the GIF: The only thing, that has worked to fix this, is ezgif's unoptimize option (found in their optimize page). But, I'd need to do that on each GIF, and there's a lot of them. I need either a way to bulk unoptimize, or a new way to open the GIF in Python (currently using Pillow), that will handle this.
At least for extracting proper single frames there might be a solution. The disposal method for all frames (except the first) is set to 2, which is "restore to background color". Diving through Pillow's source code, you'll find the according line where the disposal method 2 is considered, and, in the following, you'll find: # by convention, attempt to use transparency first color = ( frame_transparency if frame_transparency is not None else self.info.get("background", 0) ) self.dispose = Image.core.fill("P", dispose_size, color) If you check the faulty frames, you'll notice that this dark green color of the unwanted boxes is located at position 0 of the palette. So, it seems, the wrong color is picked for the disposal, because – I don't know why, yet – the above else case is picked instead of using the transparency information – which would be there! So, let's just override the possibly faulty stuff: from PIL import Image, ImageSequence # Open GIF gif = Image.open('223vK.gif') # Initialize list of extracted frames frames = [] for frame in ImageSequence.Iterator(gif): # If dispose is set, and color is set to 0, use transparency information if frame.dispose is not None and frame.dispose[0] == 0: frame.dispose = Image.core.fill('P', frame.dispose.size, frame.info['transparency']) # Convert frame to RGBA frames.append(frame.convert('RGBA')) # Visualization overhead import matplotlib.pyplot as plt plt.figure(figsize=(8, 8)) for i, f in enumerate(frames, start=1): plt.subplot(8, 8, i), plt.imshow(f), plt.axis('off') plt.tight_layout(), plt.show() The extracted frames look like this: That seems fine to me. If, by chance, the transparency information is actually set to 0, no harm should be done here, since we (re)set with the still correct transparency information. I don't know, if (re)saving to GIF will work, since frames are now in RGBA mode, and saving to GIF from there is tricky as well. ---------------------------------------- System information ---------------------------------------- Platform: Windows-10-10.0.19041-SP0 Python: 3.9.1 PyCharm: 2021.1.3 Matplotlib: 3.4.2 Pillow: 8.3.1 ----------------------------------------
You can try to use: from PIL import Image, ImageSequence im = Image.open(f"{pokemonName}.gif") index = 1 for frame in ImageSequence.Iterator(im): frame.save("frame%d.png" % index) index += 1
I've found a solution that I like for unoptimizing gifs which might be of use to you. It uses the gifsicle library, which is a command line tool for working with gifs. Crucially, gifsicle lets you unoptimize gifs like yours (I think the specific name of the optimization in your gif is "cumulative layers"). Once you install it with your package manager of choice, you can either call it within your code via Python's subprocess library, or use it yourself from the command line. You specifically mentioned a way to bulk unoptimize, and you can do that very easily with gifsicle via something like: gifsicle -U -b *.gif This will overwrite every gif in the working directory with an unoptimized version simultaneously. If you want to keep optimized copies make backups. See the manual page for more info about how to use gifsicle. Once the gif is unoptimized python should be able to open it normally.
Re-plot graph as a video
I have a dataset which needs to be re- plotted in real time as a video. There are 1000 data points in 1 second. After this, I would also like to replot the same video 1/10th of the speed. An example is shown below. I did this in another software, which had an option for doing this in a GUI interface. Is there a way to do this in R or Python? I looked into some libraries like ‘animation’ in R but could not get exactly what I wanted.
Here's an example in R using the animation package: library(animation) set.seed(2) dat = data.frame(x=1:50, y=cumsum(rnorm(50))) # Two times through the animation, once fast, once slow ani.options(interval=rep(c(1/nrow(dat),1/nrow(dat)*10), each=nrow(dat))) saveGIF(for(i in 1:(2*nrow(dat))) { plot(dat$x[1:(i %% nrow(dat))], dat$y[1:(i %% nrow(dat))], type="l", ylim=range(dat$y), xlim=range(dat$x), xlab="Time", ylab="Value") }, "my_movie.gif")
Calculating the average FIR for bunch of wave files, plotting it and saving to txt as table
Just the second day met with Python and with troubles... I've got a lot of CD-standard (16 bit, 44100 Hz) stereo wave files and need to find their average (arithmetic mean) FIR. The algorithm is easy to say... - the sum of amplitudes for each freq. divides on the amount of files. Then the achieved FIR is being plotted and written down to the text file as the table. I rolled over some similar posts like this exciting Python Scipy FFT wav files but there are still too many things, even alphabet, I lose touch in and compiler mistkes follow every time I try to repeat the examples. I would appreciate any help that can move mу from the dead-end. So, these are my shy paces... As the number of files may vary it is probably useful to have a list of files at the elbow: import os a = os.path.expanduser(u"~") # absolute user path var. b = "integrator\\files" # base folder to use with files in it c = os.path.join(a, b) flist = os.listdir(c) images = filter(lambda x: x.endswith('.wav'), flist) # filter non-wavs for i in range(len(flist)): print(flist[i]) print() And it works fine for me! But I still cannot catch how to organize the multiple files reading, and calculating their mean FIR massive As I keeked I need something like "global package": import glob import mainfile files = glob.glob('./*.wav') for ele in files: f(ele) quit() Wherу the mainfile.py looks somethng like that: import matplotlib.pyplot as plt from scipy.io import wavfile # get the api from scipy.fftpack import fft from pylab import * def f(filename): fs, data = wavfile.read(filename) # load the data a = data.T[0] # this is a two channel soundtrack, I get the first track b=[(ele/2**16.)*2-1 for ele in a] # this is 16-bit track, now normalized on [-1,1) c = fft(b) # create a list of complex number d = len(c)/2 # you only need half of the fft list And here I just don;t know what should I better do with 'd's - summing in cycle or... Then this code example operated just 1 channel for plotting - I need the output FIR as seqence of pairs for each channel. Yet still it's not clear how to tweak FFT window to Hanning with at least 65536 FFT-size (oh yes, I know thу)calculations are slow as hell). In the end we can plot and save the graph: plt.plot(abs(c[:(d-1)]),'r') savefig(filename+'.png',bbox_inches='tight') ... and somehow write average FIR to the txt table file I'd be happy enough if this script worked as the console application (though at first I dreamt of kinda minimalistic GUI with ability choose any folder containing files with certian overview button and with progress bar to make sure that app is still breathing... though hard covering ten or twenty five wavs with FFT slow "scythe". Got C:\Anaconda2 (with numpy, scipy and matplotlib properly installed) on Windows 7 x86 PC Thank you in advance! With regards, Me.
Can Matplotlib save a plot and re-open it after in its own GUI, without losing information? [duplicate]
Is there a way to save a Matplotlib figure such that it can be re-opened and have typical interaction restored? (Like the .fig format in MATLAB?) I find myself running the same scripts many times to generate these interactive figures. Or I'm sending my colleagues multiple static PNG files to show different aspects of a plot. I'd rather send the figure object and have them interact with it themselves.
I just found out how to do this. The "experimental pickle support" mentioned by #pelson works quite well. Try this: # Plot something import matplotlib.pyplot as plt fig,ax = plt.subplots() ax.plot([1,2,3],[10,-10,30]) After your interactive tweaking, save the figure object as a binary file: import pickle pickle.dump(fig, open('FigureObject.fig.pickle', 'wb')) # This is for Python 3 - py2 may need `file` instead of `open` Later, open the figure and the tweaks should be saved and GUI interactivity should be present: import pickle figx = pickle.load(open('FigureObject.fig.pickle', 'rb')) figx.show() # Show the figure, edit it, etc.! You can even extract the data from the plots: data = figx.axes[0].lines[0].get_data() (It works for lines, pcolor & imshow - pcolormesh works with some tricks to reconstruct the flattened data.) I got the excellent tip from Saving Matplotlib Figures Using Pickle.
As of Matplotlib 1.2, we now have experimental pickle support. Give that a go and see if it works well for your case. If you have any issues, please let us know on the Matplotlib mailing list or by opening an issue on github.com/matplotlib/matplotlib.
This would be a great feature, but AFAIK it isn't implemented in Matplotlib and likely would be difficult to implement yourself due to the way figures are stored. I'd suggest either (a) separate processing the data from generating the figure (which saves data with a unique name) and write a figure generating script (loading a specified file of the saved data) and editing as you see fit or (b) save as PDF/SVG/PostScript format and edit in some fancy figure editor like Adobe Illustrator (or Inkscape). EDIT post Fall 2012: As others pointed out below (though mentioning here as this is the accepted answer), Matplotlib since version 1.2 allowed you to pickle figures. As the release notes state, it is an experimental feature and does not support saving a figure in one matplotlib version and opening in another. It's also generally unsecure to restore a pickle from an untrusted source. For sharing/later editing plots (that require significant data processing first and may need to be tweaked months later say during peer review for a scientific publication), I still recommend the workflow of (1) have a data processing script that before generating a plot saves the processed data (that goes into your plot) into a file, and (2) have a separate plot generation script (that you adjust as necessary) to recreate the plot. This way for each plot you can quickly run a script and re-generate it (and quickly copy over your plot settings with new data). That said, pickling a figure could be convenient for short term/interactive/exploratory data analysis.
Why not just send the Python script? MATLAB's .fig files require the recipient to have MATLAB to display them, so that's about equivalent to sending a Python script that requires Matplotlib to display. Alternatively (disclaimer: I haven't tried this yet), you could try pickling the figure: import pickle output = open('interactive figure.pickle', 'wb') pickle.dump(gcf(), output) output.close()
Good question. Here is the doc text from pylab.save: pylab no longer provides a save function, though the old pylab function is still available as matplotlib.mlab.save (you can still refer to it in pylab as "mlab.save"). However, for plain text files, we recommend numpy.savetxt. For saving numpy arrays, we recommend numpy.save, and its analog numpy.load, which are available in pylab as np.save and np.load.
I figured out a relatively simple way (yet slightly unconventional) to save my matplotlib figures. It works like this: import libscript import matplotlib.pyplot as plt import numpy as np t = np.arange(0.0, 2.0, 0.01) s = 1 + np.sin(2*np.pi*t) #<plot> plt.plot(t, s) plt.xlabel('time (s)') plt.ylabel('voltage (mV)') plt.title('About as simple as it gets, folks') plt.grid(True) plt.show() #</plot> save_plot(fileName='plot_01.py',obj=sys.argv[0],sel='plot',ctx=libscript.get_ctx(ctx_global=globals(),ctx_local=locals())) with function save_plot defined like this (simple version to understand the logic): def save_plot(fileName='',obj=None,sel='',ctx={}): """ Save of matplolib plot to a stand alone python script containing all the data and configuration instructions to regenerate the interactive matplotlib figure. Parameters ---------- fileName : [string] Path of the python script file to be created. obj : [object] Function or python object containing the lines of code to create and configure the plot to be saved. sel : [string] Name of the tag enclosing the lines of code to create and configure the plot to be saved. ctx : [dict] Dictionary containing the execution context. Values for variables not defined in the lines of code for the plot will be fetched from the context. Returns ------- Return ``'done'`` once the plot has been saved to a python script file. This file contains all the input data and configuration to re-create the original interactive matplotlib figure. """ import os import libscript N_indent=4 src=libscript.get_src(obj=obj,sel=sel) src=libscript.prepend_ctx(src=src,ctx=ctx,debug=False) src='\n'.join([' '*N_indent+line for line in src.split('\n')]) if(os.path.isfile(fileName)): os.remove(fileName) with open(fileName,'w') as f: f.write('import sys\n') f.write('sys.dont_write_bytecode=True\n') f.write('def main():\n') f.write(src+'\n') f.write('if(__name__=="__main__"):\n') f.write(' '*N_indent+'main()\n') return 'done' or defining function save_plot like this (better version using zip compression to produce lighter figure files): def save_plot(fileName='',obj=None,sel='',ctx={}): import os import json import zlib import base64 import libscript N_indent=4 level=9#0 to 9, default: 6 src=libscript.get_src(obj=obj,sel=sel) obj=libscript.load_obj(src=src,ctx=ctx,debug=False) bin=base64.b64encode(zlib.compress(json.dumps(obj),level)) if(os.path.isfile(fileName)): os.remove(fileName) with open(fileName,'w') as f: f.write('import sys\n') f.write('sys.dont_write_bytecode=True\n') f.write('def main():\n') f.write(' '*N_indent+'import base64\n') f.write(' '*N_indent+'import zlib\n') f.write(' '*N_indent+'import json\n') f.write(' '*N_indent+'import libscript\n') f.write(' '*N_indent+'bin="'+str(bin)+'"\n') f.write(' '*N_indent+'obj=json.loads(zlib.decompress(base64.b64decode(bin)))\n') f.write(' '*N_indent+'libscript.exec_obj(obj=obj,tempfile=False)\n') f.write('if(__name__=="__main__"):\n') f.write(' '*N_indent+'main()\n') return 'done' This makes use a module libscript of my own, which mostly relies on modules inspect and ast. I can try to share it on Github if interest is expressed (it would first require some cleanup and me to get started with Github). The idea behind this save_plot function and libscript module is to fetch the python instructions that create the figure (using module inspect), analyze them (using module ast) to extract all variables, functions and modules import it relies on, extract these from the execution context and serialize them as python instructions (code for variables will be like t=[0.0,2.0,0.01] ... and code for modules will be like import matplotlib.pyplot as plt ...) prepended to the figure instructions. The resulting python instructions are saved as a python script whose execution will re-build the original matplotlib figure. As you can imagine, this works well for most (if not all) matplotlib figures.
If you are looking to save python plots as an interactive figure to modify and share with others like MATLAB .fig file then you can try to use the following code. Here z_data.values is just a numpy ndarray and so you can use the same code to plot and save your own data. No need of using pandas then. The file generated here can be opened and interactively modified by anyone with or without python just by clicking on it and opening in browsers like Chrome/Firefox/Edge etc. import plotly.graph_objects as go import pandas as pd z_data=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/api_docs/mt_bruno_elevation.csv') fig = go.Figure(data=[go.Surface(z=z_data.values)]) fig.update_layout(title='Mt Bruno Elevation', autosize=False, width=500, height=500, margin=dict(l=65, r=50, b=65, t=90)) fig.show() fig.write_html("testfile.html")
Python graphing of time series
I have a project that will requiring graphing data (with Python). rrdtool seems to be almost exactly what I need- except it only supports 1 second resolution and my data points are much closer together. GNUPlot looks like a beast. I've not dug into it yet, but in a cursory glance, it seems more tailored to graphing expressions and not something to be learned quickly. Is there another (hopefully relatively easy to use) graphing system tailored or easily adaptable to graphing time series? If there is nothing better then GNUplot, I will start digging into it- I just wanted to be sure there is nothing else I should consider.
I've made use of PyX before. It's served my purposes well when I needed to graph various data. There's also matplotlib, which is a very popular Python plotting library.
You don't fully specify what you are looking for, but here is a quick cut and paste-able example that shows some of the plotting abilities of matplotlib. The example also saves the image as a png and pdf (rasterized and vectorized respectively): import numpy as np import pylab as plt # Create some sample "time-series" data N = 200 T = np.linspace(0, 5, N) Y1 = T**2 - T*np.cos(T*5) + np.random.random(N) Y2 = T**2 - T*np.sin(T*5) + np.random.random(N) pargs = {"lw":5, "alpha":.6} plt.plot(T,Y1, 'r',**pargs) plt.plot(T,Y2, 'b', **pargs) skip = 30 plt.scatter(T[::skip],Y2[::skip], color='k', s=200, alpha=.6) plt.xlabel("time") plt.ylabel("money") plt.axis('tight') # Save as a png and a pdf plt.savefig("example.png") plt.savefig("example.pdf") plt.show()