Dask visualize method image size too small - python

I am trying to use the visualize method to visualize a Dask graph. However, the resulting image is too small (because there are a lot of nodes in my graph). How can I increase its size?
Here is the code:
from dask.diagnostics import ProgressBar
from matplotlib import pyplot as plt
df = dd.read_csv('nyc_parking_tickets_2017.csv')
missing_values = df.isnull().sum()
missing_count = ((missing_values / df.index.size) * 100)
missing_count.visualize()
This code is taken from Data Science with Python and Dask by Jesse Daniel. The dataset comes from this Kaggle dataset on NYC parking tickets.

Dask uses relatively sane defaults for graphviz. It's surprising that the image is small. However, if you want to modify the graph itself you can pass graph-level attributes to the visualize method (see the docstring). These will be passed to the GraphViz library.
You might also mean that the nodes in the graph are small, perhaps because there are very many of them. I don't recommend relying on the visualize method to gain insight if you have more than a few hundred partitions.

Related

Confusion with bandwidth on seaborn's kdeplot

lineslist, below, represents a set of lines (for some chemical spectrum, let's say), in MHz. I know the linewidth of the laser used to probe these lines to be 5 MHz. So, naively, the kernel density estimate of these lines with a bandwidth of 5 should give me the continuous distribution that would be produced in an experiment using the aforementioned laser.
The following code:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5)
plt.show()
yields
Which looks like a Gaussian with bandwidth much larger than 5 MHz.
I'm guessing that for some reason, the bandwidth of the kdeplot has different units than the plot itself. The separation between the highest and lowest line is ~170.0 MHz. Supposing that I need to rescale the bandwidth by this factor:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5/(np.max(lineslist)-np.min(lineslist)))
plt.show()
I get:
With lines that seem to have the expected 5 MHz bandwidth.
As dandy as that solution is, I've pulled it from my arse, and I'm curious whether someone more familiar with seaborn's kdeplot internals can comment on why this is.
Thanks,
Samuel
One thing to note is that Seaborn doesn't actually handle the bandwidth itself - it passes the setting on more-or-less as-is to either SciPy or the Statsmodels packages, depending on what you have installed. (It prefers Statsmodels, but will fall back to SciPy.)
The documentation for this parameter in the various sub-packages is a little confusing, but from what I can tell, the key issue here is that the setting for SciPy is a bandwidth factor, rather than a bandwidth itself. That is, this factor is (effectively) multiplied by the standard deviation of the data you're plotting to give you the actual bandwidth used in the kernels.
So with SciPy, if you have a fixed number which you want to use as your bandwidth, you need to divide through by your data standard deviation. And if you're trying to plot multiple datasets consistently, you need to adjust for the standard deviation of each dataset. This adjustment effectively what you did by scaling by the range -- but again, it's not the range of the data that's the number used, but the standard deviation of the data.
To make things all the more confusing, Statsmodels expects the true bandwidth when given a scalar value, rather than a factor that's multiplied by the standard deviation of the sample. So depending on what backend you're using, Seaborn will behave differently. There's no direct way to tell Seaborn which backend to use - the best way to test is probably trying to import statsmodels, and seeing if that succeeds (takes bandwidth directly) or fails (takes bandwidth factor).
By the way, these results were tested against Seaborn version 0.7.0 - I expect (hope?) that versions in the future might change this behavior.

What are ways to speed up seaborns pairplot

I have a dataframe with 250.000 rows but 140 columns and I'm trying to construct a pair plot. of the variables.
I know the number of subplots is huge, as well as the time it takes to do the plots. (I'm waiting for more than an hour on an i5 with 3,4 GHZ and 32 GB RAM).
Remebering that scikit learn allows to construct random forests in parallel, I was checking if this was possible also with seaborn.
However, I didn't find anything. The source code seems to call the matplotlib plot function for every single image.
Couldn't this be parallelised? If yes, what is a good way to start from here?
Rather than parallelizing, you could downsample your DataFrame to say, 1000 rows to get a quick peek, if the speed bottleneck is indeed occurring there. 1000 points is enough to get a general idea of what's going on, usually.
i.e. sns.pairplot(df.sample(1000)).
Save your pairplot to image and then show this image instead of rendering it all in your browser.
from IPython.display import Image
import seaborn as sns
import matplotlib.pyplot as plt
sns_plot = sns.pairplot(df, size=2.0)
sns_plot.savefig("pairplot.png")
plt.clf() # Clean parirplot figure from sns
Image(filename='pairplot.png') # Show pairplot as image
For me, I had a situation where the histograms were taking a very long time due to the variance in the data. I only had 1200 rows and 4 columns, but it took half an hour before I gave up. I think it was so spread out and unordered that the histogram was constantly updating. One workaround might be to play with the bin parameter, but my solution was to use a KDE for the diagonal instead. With the KDE, it takes only a few seconds.
sns.pairplot(df, diag_kind='kde')

Smoothing HEALPix maps with `healpy`: Why does the output map appear "patchy"?

I have a HEALPix all-sky map, from the AKARI Far Infrared Surveyor databse (publicly released). I have tried to "smooth" the map using healpy, but the result looks very strange. Is there a better way? My question however relates to any all-sky HEALPix map (i.e. IRAS, Planck, WISE, WMAP).
My objective is to "smooth" the effective point-spread function of this AKARI map to an angular resolution of 1-degree (the original data has a PSF of about 1 arcminute). This is so that I can compare the far infrared AKARI map to lower resolution microwave maps (specifically, those of the anomalous microwave foreground).
In my example below, I'm using a degraded version of the map, so it'd be small enough to upload to Github. This means that the pixels are about 3.42 arcminutes. I wouldn't degrade the pixel scale so much, before PSF smoothing, normally- but this is just an example:
#Load the packages needed for visualization, and HEALPix processing
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import healpy as hp
import healpy.projector as pro
#Loads the HEALPix .FITS file into an array
map_in = hp.read_map("akari_WideL_1_1024.fits", nest = True)
#Visualizes the all-sky map, before any processing is done.
hp.mollview(map_in, title='AKARI All-Sky Map:', nest = True, norm = 'hist')
#Smoothes the map with a 1-degree FWHM Gaussian (fwhm given in radians).
map_out = hp.sphtfunc.smoothing(map_out, fwhm = 0.017, iter = 1)
#Visualizes the the map after smoothing
hp.mollview(map_out, title='AKARI All-Sky Map:', nest = True, norm = 'hist')
I have tried the healpy.sphtfunct.smoothing routine (https://healpy.readthedocs.org/en/latest/generated/healpy.sphtfunc.smoothing.html#healpy.sphtfunc.smoothing).As far as I understand, smoothing converts the map into spherical harmonics, then convolves with the gaussian, and then converts it back into a spatial map.
I've saved the ipython notebook as well as the low-res .FITS HEALpix map in a github repository, here:
https://github.com/aaroncnb/healpy_smoothing_test
(You'll need to have the healpy package installed)
By running the code in the notebook, you can easily visualize the trouble I'm having- after smoothing the map, there are some strange "artifacts", as if the pixels had been iteratively box-averaged, rather than smoothed with a circular guassian profile. What I expect to see, is just a blurrier version of the input map.
I think I'm missing something fundamental about the conversion to spherical harmonics, before the smoothing is done.
Has anyone tried to do this kind of all-sky smoothing before, on a HEALPix map?
I believe another option is to convert the map to a standard, rectangular array, and then conduct the smoothing. However I remain curious about solving the problem without leaving the HEALPix format.
It appears smoothing works on a RINGed map only (it kind of makes sense to me, since this seems a bit easier to handle mathematically). Thus, you'll need to convert your input map to a RINGed format:
map_ring = hp.pixelfunc.reorder(map_in, inp='NEST', out='RING')
map_out = hp.sphtfunc.smoothing(map_ring, fwhm = 0.17, iter = 1)
hp.mollview(map_out, title='AKARI All-Sky Map:', nest = False, norm = 'hist')
This answer is from a bit of trial and error, because I can't find anything definitive on it in the documentation, and I haven't dived into the source code (though, with the below result, it may be easy to verify whether my assumption is correct by looking through the relevant source code).
Or, you may want to ask the healpix/healpy people directly.
(I'd suggest it is in fact a shortcoming in the documentation: the docs for healpy.sphtfunc.smoothing don't mention the required form for the input. I guess that's a healpy issue/PR for another day.)
Btw, bonus points for creating a SSCCE as a notebook file on Github! (Now if only StackOverflow also rendered notebooks.)

Reduce the Size of Matplotlib Basemap

In various operational and manually-run Python scripts, I produce thousands of plots of California every day using matplotlib's Basemap:
mp = Basemap(width=1284000,height=1164000,projection='lcc',lat_1=30.,lat_2=60,\
lat_0=37,lon_0=-120.5,resolution='c',rsphere=6370000.00)
# fill data: lats, lons, grid
cs = orb.pcolor(lons,lats,grid,cmap=cm.jet,norm=colors.LogNorm(),latlon=True)
mp.drawcounties()
However, I lose dozens of hours of computing time every single day to loading these large basemaps that cover the entire world. All I need is the counties in California, but I end up loading all the counties in the world into memory. I love how flexible this library is, but it is officially too slow.
How could I produce these maps faster?
Could I somehow create a cropped version of the matplotlib Basemap, with only California?
Is there a faster, non-matplotlib way to create geo-referenced plots?
How could I produce these maps faster?
You are already using the lowest resolution ("c"). You can change the "area_thresh" argument, but I don't think that'll help much. You'll probably have to work on the way you're plotting the data itself.
Could I somehow create a cropped version of the matplotlib Basemap, with only California?
You can download a custom map of California and plot only that. Basemap comes with basic world maps, but you can use maps from other sources. It should be easy to get one googling "california shapefile". Use the "readshapefile" method in mpl_toolkits.basemap to load it.
This notebook shows how to work with readshapefile and with an alternative method (the pyshp module).
Is there a faster, non-matplotlib way to create geo-referenced plots?
I've heard good things about Vincent, but I haven't tried it myself.

Matplotlib slow with large data sets, how to enable decimation?

I use matplotlib for a signal processing application and I noticed that it chokes on large data sets. This is something that I really need to improve to make it a usable application.
What I'm looking for is a way to let matplotlib decimate my data. Is there a setting, property or other simple way to enable that? Any suggestion of how to implement this are welcome.
Some code:
import numpy as np
import matplotlib.pyplot as plt
n=100000 # more then 100000 points makes it unusable slow
plt.plot(np.random.random_sample(n))
plt.show()
Some background information
I used to work on a large C++ application where we needed to plot large datasets and to solve this problem we used to take advantage of the structure of the data as follows:
In most cases, if we want a line plot then the data is ordered and often even equidistantial. If it is equidistantial, then you can calculate the start and end index in the data array directly from the zoom rectangle and the inverse axis transformation. If it is ordered but not equidistantial a binary search can be used.
Next the zoomed slice is decimated, and because the data is ordered we can simply iterate a block of points that fall inside one pixel. And for each block the mean, maximum and minimum is calculated. Instead of one pixel, we then draw a bar in the plot.
For example: if the x axis is ordered, a vertical line will be drawn for each block, possibly the mean with a different color.
To avoid aliasing the plot is oversampled with a factor of two.
In case it is a scatter plot, the data can be made ordered by sorting, because the sequence of plotting is not important.
The nice thing of this simple recipe is that the more you zoom in the faster it becomes. In my experience, as long as the data fits in memory the plots stays very responsive. For instance, 20 plots of timehistory data with 10 million points should be no problem.
It seems like you just need to decimate the data before you plot it
import numpy as np
import matplotlib.pyplot as plt
n=100000 # more then 100000 points makes it unusable slow
X=np.random.random_sample(n)
i=10*array(range(n/10))
plt.plot(X[i])
plt.show()
Decimation is not best for example if you decimate sparse data it might all appear as zeros.
The decimation has to be smart such that each LCD horizontal pixel is plotted with the min and the max of the data between decimation points. Then as you zoom in you see more an more detail.
With zooming this can not be done easy outside matplotlib and thus is better to handle internally.

Categories

Resources