Let's say we have pandas dataframe pd and a dask dataframe dd. When I want to plot pandas one with matplotlib I can easily do it:
fig, ax = plt.subplots()
ax.bar(pd["series1"], pd["series2"])
fig.savefig(path)
However, when I am trying to do the same with dask dataframe I am getting Type Errors such as:
TypeError: Cannot interpret 'string[python]' as a data type
string[python] is just an example, whatever is your dd["series1"] datatype will be inputed here.
So my question is: What is the proper way to use matplotlib with dask, and is this even a good idea to combine the two libraries?
One motivation to use dask instead of pandas is the size of the data. As such, swapping pandas DataFrame with dask DataFrame might not be feasible. Imagine a scatter plot, this might work well with 10K points, but if the dask dataframe is a billion rows, a plain matplotlib scatter is probably a bad idea (datashader is a more appropriate tool).
Some graphical representations are less sensitive to the size of the data, e.g. normalized bar chart should work well, as long as the number of categories does not scale with the data. In this case the easiest solution is to use dask to compute the statistics of interest before plotting them using pandas.
To summarise: I would consider the nature of the chart, figure out the best tool/representation, and if it's something that can/should be done with matplotlib, then I would run computations on dask DataFrame to get the reduced result as a pandas dataframe and proceed with the matplotlib
SultanOrazbayev's is still spot on, here is an answer elaborating on the datashader option (which hvplot call under the hood).
Don't use Matplotlib, use hvPlot!
If you wish to plot the data while it's still large, I recommend using hvPlot, as it can natively handle dask dataframes. It also automatically provides interactivity.
Example
import numpy as np
import dask
import hvplot.dask
# Create Dask DataFrame with normally distributed data
df = dask.datasets.timeseries()
df['x'] = df['x'].map_partitions(lambda x: np.random.randn(len(x)))
df['y'] = df['y'].map_partitions(lambda x: np.random.randn(len(x)))
# Plot
df.hvplot.scatter(x='x', y='y', rasterize=True)
Related
I have a fairly large pandas data frame((4000, 103) and for smaller dataframes I love using pairplot to visually see patterns in my data. But for my larger dataset the same command runs for hour+ with no output.
Is there an alternative tool to get the same outcome or a way to speed up the command? I tried to use the sample option on pandas to reduce the dataset but it still takes over a hour with no outcome.
dfSample = myData.sample(100) # make dataset smaller
sns.pairplot(dfSample, diag_kind="hist")
You should sample from colums, so replace your first line by
dfSample=myData.sample(10, axis=1).
And live happy.
Hello dear Community,
I haven't found something similar during my search and hope I haven't overseen anything. I have the following issue:
I have a big dataset whichs shape is 1352x121797 (1353 samples and 121797 time points). Now I have clustered these and would like to generate one plot for each cluster in which every time series for this cluster is plotted.
However, when using the matplotlib syntax it is like super extremely slow (and I'm not exactly sure where that comes from). Even after 5-10 minutes it hasn't finished.
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
for index, values in subset_cluster.iterrows(): # One Cluster subset, dataframe of shape (11x121797)
ax.plot(values)
fig.savefig('test.png')
Even, when inserting a break after ax.plot(values) it still doesn't finish. I'm using Spyder and thought that it might be due to Spyder always rendering the plot inline in the console.
However, when simply using the pandas method of the Series values.plot() instead of ax.plot(values) the plot appears and is saved in like 1-2 seconds.
As I need the customization options of matplotlib for standardizing all the plots and make them look a little bit prettier I would love to use the matplotlib syntax. Anyone has any ideas?
Thanks in advance
Edit: so while trying around a little bit it seems, that the rendering is the time-consuming part. When ran with the backend matplotlib.use('Agg'), the plot command runs through quicker (if using plt.plot() instead of ax.plot()), but plt.savefig() then takes forever. However, still it should be in a considerable amount of time right? Even for 121xxx data points.
Posting as answer as it may help OP or someone else: I had the same problem and found out that it was because the data I was using as x-axis was an Object, while the y-axis data was float64. After explicitly setting the object to DateTime, plotting With Matplotlib went as fast as Pandas' df.plot(). I guess that Pandas does a better job at understanding the data type when plotting.
OP, you might want to check if the values you are plotting are in the right type, or if, like me, you had some problems when loading the dataframe from file.
I'm trying to get a Seaborn barplot containing the top n entries from a dataframe, sorted by one of the columns.
In Pandas, I'd typically do this using something like this:
df = df.sort_values('ColumnFoo', ascending=False)
sns.barplot(data=df[:10], x='ColumnFoo', y='ColumnBar')
Trying out Dask, though, there is (fairly obviously) no option to sort a dataframe, since dataframes are largely deferred objects, and sorting them would eliminate many of the benefits of using Dask in the first place.
Is there a either get ordered entries from a dataframe, or to have Seaborn pick the top n values from a dataframe's column?
If you're moving data to seaborn then it almost certainly fits in memory. I recommend just converting to a Pandas dataframe and then doing the sorting there.
Generally, once you've hit the small-data regime there is no reason to use Dask over Pandas. Pandas is more mature and a smoother experience. Dask Dataframe developers recommend using Pandas when feasible.
is there any possibility to plot an APACHE Dataframe? I figured it out while converting it to a Pandas dataframe which takes a lot of time and is not my goal.
In particular, the goal is to plot a map out of an Apache DataFrame without convertion to a Pandas DataFrame.
With plotting I mean to use a library such as matplotlib or plotly for plotting a graph or something similar.
Any ideas?
Thanks!
Do you mean plot an Spark Dataframe?
In that case, you could do something like this, having yourDF as your Dataframe:
yourDF.show(100, truncate=false)
This will show in your logging your dataframe structure and values (in this case, first 100 rows) the same way you'll find in pandas. With the truncate option you specify you want to show the whole dataframe instead of a reduced version.
EDIT: in order to directly plot from a dataframe, please check the plotly lib, o the
display(dataframe)
function, documented here.
I'm learning how to plot things (CSV files) in Python, using import matplotlib.pyplot as plt.
Column1;Column2;Column3;
1;4;6;
2;2;6;
3;3;8;
4;1;1;
5;4;2;
I can plot the one above with plt.plotfile('test0.csv', (0, 1), delimiter=';'), which gives me the figure below.
Do you see the axis labels, column1 and column2? They are in lower case in the figure, but in the data file they beggin with upper case.
Also, I tried plt.plotfile('test0.csv', ('Column1', 'Column2'), delimiter=';'), which does not work.
So it seems matplotlib.pyplot works only with lowercase names :(
Summing this issue with this other, I guess it's time to try something else.
As I am pretty new to plotting in Python, I would like to ask: Where should I go from here, to get a little more than what matplotlib.pyplot provides?
Should I go to pandas?
You are mixing up two things here.
Matplotlib is designed for plotting data. It is not designed for managing data.
Pandas is designed for data analysis. Even if you were using pandas, you would still need to plot the data. How? Well, probably using matplotlib!
Independently of what you're doing, think of it as a three step process:
Data aquisition, data read-in
Data processing
Data representation / plotting
plt.plotfile() is a convenience function, which you can use if you don't need step 2. at all. But it surely has its limitations.
Methods to read in data (not complete of course) are using pure python open, python csvReader or similar, numpy / scipy, pandas etc.
Depeding on what you want to do with your data, you can already chose a suitable input method. numpy for large numerical data sets, pandas for datasets which include qualitative data or heavily rely on cross correlations etc.