Why is matplotlib plotting so much slower than pd.DataFrame.plot()? - python

Hello dear Community,
I haven't found something similar during my search and hope I haven't overseen anything. I have the following issue:
I have a big dataset whichs shape is 1352x121797 (1353 samples and 121797 time points). Now I have clustered these and would like to generate one plot for each cluster in which every time series for this cluster is plotted.
However, when using the matplotlib syntax it is like super extremely slow (and I'm not exactly sure where that comes from). Even after 5-10 minutes it hasn't finished.
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
for index, values in subset_cluster.iterrows(): # One Cluster subset, dataframe of shape (11x121797)
ax.plot(values)
fig.savefig('test.png')
Even, when inserting a break after ax.plot(values) it still doesn't finish. I'm using Spyder and thought that it might be due to Spyder always rendering the plot inline in the console.
However, when simply using the pandas method of the Series values.plot() instead of ax.plot(values) the plot appears and is saved in like 1-2 seconds.
As I need the customization options of matplotlib for standardizing all the plots and make them look a little bit prettier I would love to use the matplotlib syntax. Anyone has any ideas?
Thanks in advance
Edit: so while trying around a little bit it seems, that the rendering is the time-consuming part. When ran with the backend matplotlib.use('Agg'), the plot command runs through quicker (if using plt.plot() instead of ax.plot()), but plt.savefig() then takes forever. However, still it should be in a considerable amount of time right? Even for 121xxx data points.

Posting as answer as it may help OP or someone else: I had the same problem and found out that it was because the data I was using as x-axis was an Object, while the y-axis data was float64. After explicitly setting the object to DateTime, plotting With Matplotlib went as fast as Pandas' df.plot(). I guess that Pandas does a better job at understanding the data type when plotting.
OP, you might want to check if the values you are plotting are in the right type, or if, like me, you had some problems when loading the dataframe from file.

Related

Fail to allocate bitmap error on numeric data using pandas profiling

I am doing exploratory data analysis on my numeric data and i tried to run pandas profiling but i got error while generating report structure.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv('mydatadata.csv')
print(df)
profile = ProfileReport(df)
profile.to_file(output_file="mydata.html")
and the error log looks like this
Summarize dataset:
99%|███████████████████████████████████████████████████████████████████████▌|
1144/1150 [46:07<24:03, 240.60s/it, Calculate cramers
correlation]C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_profiling\model\correlations.py:101:
UserWarning: There was an attempt to calculate the cramers
correlation, but this failed. To hide this warning, disable the
calculation (using df.profile_report(correlations={"cramers": {"calculate": False}}) If this is problematic for your use case,
please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues (include
the error message: 'No data; observed has size 0.') warnings.warn(
Summarize dataset:
100%|██████████████████████████████████████████████████████████████████████████████████▋|
1145/1150 [46:19<17:32, 210.49s/it, Get scatter matrix]Fail to
allocate bitmap
Reason your code may have failed
If your code failed for the same reason as mine, either:
Tried making multiple profiles at the same time
Tried making a profile with a large dataset in terms of variables
Possible fix in your code
There is a workaround that is documented on the Github page for pandas-profiling under large datasets. In it, there is this example:
from pandas_profiling import ProfileReport
profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")
Possible fix in the pandas-profiling source?
I got the exact same error. I tried looking it up and it seemed to be coming from Matplotlib with a memory leak. Meaning, the plots were not properly being erased when they were formed. I tried adding the following to the utils.py file within the visualization folder of pandas_profiling:
plt.clf()
plt.cla()
plt.close('all')
The code there originally had plt.close() which I have found in the past to not be enough when making multiple plots back to back. However, I still got this error which makes me think it may not be Matplotlib (or I missed it somewhere.
The minimal=True fix above may work sometimes. It still fails for me occasionally when my datasets are too big.
I, too, have ran into that particular error case. Personal experimentation with configuration file settings narrowed things down to the quantity of plots generated, or else requested for generation up until that point in runtime execution. What varied in my experimentation was the serial number of PlotReport() profile generations that required interaction graphs: the less interaction graphs plotted overall, the less frequently this error is encountered.
Isolation of processes via multiprocessing Pool.map() calls for this stuff might help if you're only trying to generate a single profile and you absolutely need all of your interaction graphs, but that's a time and/or RAM-greedy option that would require getting creative with instanced joins for pairing column values for interaction graphs from smaller DataFrames, e.g. more reports overall.
Regardless, since documentation for ProfileReport configuration settings is terrible, here's the ones I've staggered into figuring out which you should probably investigate:
vars.cat.words -- Boolean, triggers word cloud graph generation for string/categorical variables, probably don't need it.
missing_diagrams.bar -- Boolean, turn off if graph is unnecessary
missing_diagrams.matrix -- Boolean, turn off if graph is unnecessary
missing_diagrams.heatmap -- Boolean, turn off if graph is unnecessary
missing_diagrams.dendrogram -- Boolean, turn off if graph is unnecessary
correlations.everythingthatisn'tpearsonifyoudon'tneedit -- Boolean, turn off if graph is unnecessary.
interactions.targets -- Python list of strings. Specifying one or more column names here will limit interaction graphs to just those involving these variables. This is probably what you're looking for if you just can't bear to drop columns prior to report generation.
interactions.continuous -- Boolean, turn off if you just don't want interactions anyway.
plot.image_format -- string, 'svg' or 'png', and I guess there's a data structure size difference which may or may not be related to the apparent memory leak the other guy was getting at, but go figure.
plot.dpi -- integer, specifies dpi of plotted images, and likewise may be related to an apparent memory leak.
plot.pie.max_unique -- integer, set to 0 to disable occasional pie charts being graphed, likewise may be related to apparent memory leak from graph plotting.
Good luck, and don't be afraid to try out other options like DataPrep. I wish I did before going down this rabbit hole.

Visualization workflow of large Pandas datasets

I am handling datasets of several GB, which I process in parallel with the multiprocessing library.
It takes a lot of time, but it make sense.
Once I have the resultant dataset, I need to plot it.
In this particular case, through matplotlib, I generate my stacked bar chart with:
plot = df.plot(kind='bar',stacked=True)
fig = plot.get_figure()
fig.savefig('plot.pdf', bbox_inches='tight')
At this point, for large datasets, is simply unmanageable. This method is performed sequentially, so it does not care how many cores do you have.
The generated plot is saved in a pdf, which in turn, is also really heavy, and slow to open.
Is there any alternative workflow to generate lighter plots?
So far, I've tried with dropping alternate rows from the original dataset (this process may be repeated several times, until reaching a more handy dataset). This is done with:
df = df.iloc[::2]
Let's say that it sort of work. However, I'd like to know if there are other approaches for this.
How do you face this type of large dimension visualization?

Plot specifying column by name, upper case issue

I'm learning how to plot things (CSV files) in Python, using import matplotlib.pyplot as plt.
Column1;Column2;Column3;
1;4;6;
2;2;6;
3;3;8;
4;1;1;
5;4;2;
I can plot the one above with plt.plotfile('test0.csv', (0, 1), delimiter=';'), which gives me the figure below.
Do you see the axis labels, column1 and column2? They are in lower case in the figure, but in the data file they beggin with upper case.
Also, I tried plt.plotfile('test0.csv', ('Column1', 'Column2'), delimiter=';'), which does not work.
So it seems matplotlib.pyplot works only with lowercase names :(
Summing this issue with this other, I guess it's time to try something else.
As I am pretty new to plotting in Python, I would like to ask: Where should I go from here, to get a little more than what matplotlib.pyplot provides?
Should I go to pandas?
You are mixing up two things here.
Matplotlib is designed for plotting data. It is not designed for managing data.
Pandas is designed for data analysis. Even if you were using pandas, you would still need to plot the data. How? Well, probably using matplotlib!
Independently of what you're doing, think of it as a three step process:
Data aquisition, data read-in
Data processing
Data representation / plotting
plt.plotfile() is a convenience function, which you can use if you don't need step 2. at all. But it surely has its limitations.
Methods to read in data (not complete of course) are using pure python open, python csvReader or similar, numpy / scipy, pandas etc.
Depeding on what you want to do with your data, you can already chose a suitable input method. numpy for large numerical data sets, pandas for datasets which include qualitative data or heavily rely on cross correlations etc.

Need help to solve performance problems of matplotlib on a Raspberry Pi

First off, sorry for this lenghty text. I'm new to python and matplotlib, so please bear with me.
As a followup to this question I found the way of generating the grid to be quite time consuming on a Raspberry Pi using web2py. I have a csv file with round about 12k lines looking like this:
1;1.0679759979248047;0.0;147.0;0.0;;{'FHR1': 'US', 'FHR2': 'INOP', 'MHR': 'INOP'};69;good;;;;1455891539.502167
The thing is, that reading those 12k lines with numpy.genfromtxt already takes 30 something seconds. Populating the chart then (without the fancy grids) took another 30 seconds, just using columns 1, 3 and 7 of that csv. But after adding the solution time exploded to 170 seconds. So now I have to figure out what to do to reduce time consumption to somewhere under a minute.
My first thought is to eliminate the csv - I'm the one reading the data anyway, so I could skip that by either keeping the data in memory or by just writing it into the plot right away. And that's what I did, with a (in my mind) simple test layout and using the pdf backend. I chose to write the data into the chart every time I get them and save the chart once the transmission is done. I thought that should work fine, well it doesn't. I keep getting ludicrous errors:
RuntimeError: RRuleLocator estimated to generate 9178327 ticks from 0001-01-01 15:20:31.883239+00:00 to 0001-04-17 20:52:39.779205+00:00: exceeds Locator.MAXTICKS * 2 (6000000)
And believe me, I keep increasing those maxticks with every test run to top the number the error message says. Its ridiculous because that message is just for 60 seconds of data, and I want to go somewhere near 24 hours of data. I would either like the RRuleLocator to stop estimating or to just shut up and wait for the data to end. I really don't think I can make an MCVE out of this, but I can carve out the details I'm most likely messing up.
First off, I got some classes set up, so no globals. To simplify I have a communications class, that reads the serial port at one second intervals. This is running fine and up till yesterday wrote whatever came in on the serial port into a csv. Now I wanted to see if I could populate the chart while getting the data, and just save it, once I'm done. So for testing I added this to my .py
import matplotlib
matplotlib.use('PDF') # I want a PDF in the end
from matplotlib import dates
import matplotlib.pyplot as plt
import numpy as np
from numpy import genfromtxt
Then some members to the communication class, that come from the charting part, mainly above mentioned solution. I initialize them in the classes init
self.fig = None
self.ctg = None
self.toco = None
then I have this method I call, once I feel the data I'm receiving is in correct form/state so that the chart may be prepared for populating with data
def prepchart(self):
# how often to show xticklabels and repeat yticklabels:
print('prepchart')
xtickinterval = 5
hfmt = dates.DateFormatter('%H:%M:%S')
self.fig = plt.figure()
self.ctg = self.fig.add_subplot(2, 1, 1) # two rows, one column, first plot
plt.ylim(50, 210)
self.toco = self.fig.add_subplot(2, 1, 2)
plt.ylim(0, 100)
# Set the minor ticks to every 30 seconds
minloc = dates.SecondLocator(bysecond=[0,30])
minloc.MAXTICKS = 3000000
self.ctg.xaxis.set_minor_locator(minloc)
# self.ctg.xaxis.set_minor_locator(dates.MinuteLocator())
self.ctg.xaxis.set_major_formatter(hfmt)
self.toco.xaxis.set_minor_locator(dates.MinuteLocator())
self.toco.xaxis.set_major_formatter(hfmt)
# actg.xaxis.set_ticks(rotation=45)
plt.xticks(rotation=45)
Then every so often once I have data I want to plot I'll do this in my data processing method
self.ctg.plot_date(timevalue, heartrate, '-')
self.toco.plot_date(timevalue, toco, '-')
finally once no more data is sent (this can be after one minute or 24 hours) I'll call
def handleCTG(self):
self.fig.savefig('/home/pi/test.pdf')
In conclusion: Am I going at this completely wrong or is there just an error in my code? And is this really a way to reduce waiting time for the chart to be generated?
OK, so here's the deal. Obviously web2py runs a pretty tight ship. Meaning that there are not so many threads floating around, and it sure wont start a new thread for my little chart creation. I sort of noticed this, when I followed CPU usage on my Raspis taskmanager and only ever saw something around 25%. Now the Raspberry Pi has 4 kernels... go do the math. First I ran my script outside of web2py on my Raspi and, lo and behold, the entire thing including csv-reading and chart rendering only takes 20s. From there on (inspired by How to run a task outside web2py and retrieve the output) it's a piece of cake: use the well documented subprocess to call a new python with this script and done. So thanks to anyone who gave this some thought.

How to set x lim to 99.5 percentile of my data series for matplotlib histogram?

I'm currently pumping out some histograms with matplotlib. The issue is that because of one or two outliers my whole graph is incredibly small and almost impossible to read due to having two separate histograms being plotted. The solution I am having problems with is dropping the outliers at around a 99/99.5 percentile. I have tried using:
plt.xlim([np.percentile(df,0), np.percentile(df,99.5)])
plt.xlim([df.min(),np.percentile(df,99.5)])
Seems like it should be a simple fix, but I'm missing some key information to make it happen. Any input would be much appreciated, thanks in advance.
To restrict focus to just the middle 99% of the values, you could do something like this:
trimmed_data = df[(df.Column > df.Column.quantile(0.005)) & (df.Column < df.Column.quantile(0.995))]
Then you could do your histogram on trimmed_data. Exactly how to exclude outliers is more of a stats question than a Python question, but basically the idea I was suggesting in a comment is to clean up the data set using whatever methods you can defend, and then do everything (plots, stats, etc.) on only the cleaned dataset, rather than trying to tweak each individual plot to make it look right while still having the outlier data in there.

Categories

Resources