Matplotlib crashes - python

I'm try to learn data visualization with matplotlib, but it keeps crashing whenever I try the plot function:
>>> import pylab
>>> import numpy as np
>>> import pandas as pd
>>> names1880 = pd.read_csv('/root/yob1880.txt', names=['name', 'sex', 'births'])
>>> pieces = []
>>> for year, group in names.groupby(['year', 'sex']):
... pieces.append(group.sort_index(by='births', ascending=False)[:1000])
... top1000 = pd.concat(pieces, ignore_index=True)
>>> table = top1000.pivot_table(rows='year', cols='sex', aggfunc=sum)
>>> table.plot(title='Sum of table1000 by year and sex',
yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
Qt: Session management error: None of the authentication protocols specified are supported
<matplotlib.axes.AxesSubplot object at 0x3063610>
I'm running Kali Linux 1.0.7 The data frame looks fine, my code runs fine until I attempt to plot it, so why am I getting that error every time, I try using the pylab.show() function and it doesn't even plot the line?

Looks like the problem is with Qt not Matplotlib. What happens when you run qtcreator or something similar? If that fails than you probably need to rebuild Qt.

Related

How to use pandas .describe()

I have used pandas in the past but I have recently run into a problem where my code is not displaying the .head() or the .describe() function. I have copied my code below from another website and it is still not displaying. Any help is appreciated.
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
training_df = pd.read_csv(filepath_or_buffer="california_housing_train.csv")
training_df["median_house_value"] /= 1000.0
training_df.describe(include = 'all')
Your answer will work in a notebook or REPL, but doesn't actually print. Make sure to call the print() function to output while running.

Python Pandas leaking memory?

I have a script that is constantly measuring some data and regularly storing it in a file. In the past I was storing the data in a "manually created CSV" file in this way (pseudocode):
with open('data.csv','w') as ofile:
print('var1,var2,var3,...,varN', file=ofile) # Create CSV header.
while measure:
do_something()
print(f'{var1},{var2},{var3},...,{varN}', file=ofile) # Store data.
I worked in this way for several months and runned this script several hundreds of times with no issues other than 1) this is cumbersome (and prone to errors) when N is large (in my case between 20 and 30) and 2) CSV does not preserve data types. So I decided to change to something like this:
temporary_df = pandas.DataFrame()
while measure:
do_something()
temporary_df.append({'var1':var1,'var2':var2,...,'varN':varN}, ignore_index=True)
if save_data_in_this_iteration():
temporary_df.to_feather(f'file_{datetime.datetime.now()}.fd')
temporary_df = pandas.DataFrame() # Clean the dataframe.
merge_all_feathers_into_single_feather()
At a first glance this was working perfectly as I expected. However, after some hours Python crashes. After experiencing this both in a Windows and in a (separate) Linux machine I, I noticed that the problem is that Python is slowly sucking the memory of the machine until there is no more memory, and then of course it crashes.
As the function do_something is unchanged between the two approaches, and the crash happens before merge_all_feathers_into_single_feather is called, and save_data_in_this_iteration is trivially simple, I am blaming Pandas for this problem.
Google has told me that other people in the past have had memory problems while using Pandas. I have tried adding the garbage collector line in each iteration, as suggested e.g. here, but did not worked for me. I didn't tried the mutiprocessing approach yet because it looks like killing an ant with a nuke, and may bring other complications...
Is there any solution to keep using Pandas like this? Is there a better solution to this without using Pandas? Which?
Pandas was not the problem
After struggling with this problem for a while, I decided to create a MWE to do some tests. So I wrote this:
import pandas
import numpy
import datetime
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
df = df[0:0]
To my surprise, the memory is not drained by this script! So here I concluded that Pandas was not my problem.
Then I added a new component to the MWE and I found the issue:
import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
def save_matplotlib_plot(df):
fig, ax = plt.subplots()
ax.plot(df['col_1'], df['col_2'])
fig.savefig('delete_me.png')
# Uncomment the following two lines to release the memory and stop the "leak".
# ~ fig.clear()
# ~ plt.close(fig)
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
save_matplotlib_plot(df) # Here I had my "leak" (which was not a leak indeed because matplotlib keeps track of all the figures it creates, so it was working as expected).
df = df[0:0]
It seems that when I switched from "handmade CSV" to "Pandas" I also changed something with the plots, so I was blaming Pandas when it was not the problem.
Just for completeness, the multiprocessing solution also works. The following script has no memory issues:
import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
from multiprocessing import Process
def save_matplotlib_plot(df):
fig, ax = plt.subplots()
ax.plot(df['col_1'], df['col_2'])
fig.savefig('delete_me.png')
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
p = Process(target=save_matplotlib_plot, args=(df,))
p.start()
p.join()
df = df[0:0]

Codes in Ipython vs Pycharm

I am a newbie and the following question may be dumb and not well written.
I tried the following block of codes in Ipython:
%pylab qt5
x = randn(100,100)
y = mean(x,0)
import seaborn
plot(y)
And it delivered a plot. Everything was fine.
However, when I copied and pasted those same lines of codes to Pycharm and tried running, syntax error messages appeared.
For instance,
%pylab was not recognized.
Then I tried to import numpy and matplotlib one by one. But then,
randn(.,.) was not recognized.
You can use IPython/Jupyter notebooks in PyCharm by following this guide:
https://www.jetbrains.com/help/pycharm/using-ipython-jupyter-notebook-with-pycharm.html
You may modify code like the snippet below in order to run in PyCharm:
from numpy.random import randn
from numpy import mean
import seaborn
x = randn(10, 10)
y = mean(x, 0)
seaborn.plt.plot(x)
seaborn.plt.show()

Getting ggplot for Python to make a bar chart

Following this simple example, I am trying to make a dirt simple bar chart using yhat's ggplot python module. Here is the code suggested previously on StackOverflow:
In [1]:
from ggplot import *
import pandas as pd
df = pd.DataFrame({"x":[1,2,3,4], "y":[1,3,4,2]})
ggplot(aes(x="x", weight="y"), df) + geom_bar()
But I get an error:
Out[1]:
<repr(<ggplot.ggplot.ggplot at 0x104b18ad0>) failed: UnboundLocalError: local variable 'ax' referenced before assignment>
This works a newer version of ggplot-python. It's not that pretty (x-axis labels), we really have to work on that :-(

Rpy2 & ggplot2: LookupError 'print.ggplot'

Unhindered by any pre-existing knowledge of R, Rpy2 and ggplot2 I would never the less like to create a scatterplot of a trivial table from Python.
To set this up I've just installed:
Ubuntu 11.10 64 bit
R version 2.14.2 (from r-cran mirror)
ggplot2 (through R> install.packages('ggplot2'))
rpy2-2.2.5 (through easy_install)
Following this I am able to plot some example dataframes from an interactive R session using ggplot2.
However, when I merely try to import ggplot2 as I've seen in an example I found online, I get the following error:
from rpy2.robjects.lib import ggplot2
File ".../rpy2/robjects/lib/ggplot2.py", line 23, in <module>
class GGPlot(robjects.RObject):
File ".../rpy2/robjects/lib/ggplot2.py", line 26, in GGPlot
_rprint = ggplot2_env['print.ggplot']
File ".../rpy2/robjects/environments.py", line 14, in __getitem__
res = super(Environment, self).__getitem__(item)
LookupError: 'print.ggplot' not found
Can anyone tell me what I am doing wrong? As I said the offending import comes from an online example, so it might well be that there is some other way I should be using gplot2 through rpy2.
For reference, and unrelated to the problem above, here's an example of the dataframe I would like to plot, once I get the import to work (should not be a problem looking at the examples). The idea is to create a scatter plot with the lengths on the x axis, the percentages on the Y axis, and the boolean is used to color the dots, whcih I would then like to save to a file (either image or pdf). Given that these requirements are very limited, alternative solutions are welcome as well.
original.length row.retained percentage.retained
1 1875 FALSE 11.00
2 1143 FALSE 23.00
3 960 FALSE 44.00
4 1302 FALSE 66.00
5 2016 TRUE 87.00
There were changes in the R package ggplot2 that broke the rpy2 layer.
Try with a recent (I just fixed this) snapshot of the "default" branch (rpy2-2.3.0-dev) for the rpy2 code on bitbucket.
Edit: rpy2-2.3.0 is a couple of months behind schedule. I just pushed a bugfix release rpy2-2.2.6 that should address the problem.
Although I can't help you with a fix for the import error you're seeing, there is a similar example using lattice here: lattice with rpy2.
Also, the standard R plot function accepts coloring by using the factor function (which you can feed the row.retained column. Example:
plot(original.length, percentage.retained, type="p", col=factor(row.retained))
Based on fucitol's answer I've instead implemented the plot using both the default plot & lattice. Here are both the implementations:
from rpy2 import robjects
#Convert to R objects
original_lengths = robjects.IntVector(original_lengths)
percentages_retained = robjects.FloatVector(percentages_retained)
row_retained = robjects.StrVector(row_retained)
#Plot using standard plot
r = robjects.r
r.plot(x=percentages_retained,
y=original_lengths,
col=row_retained,
main='Title',
xlab='Percentage retained',
ylab='Original length',
sub='subtitle',
pch=18)
#Plot using lattice
from rpy2.robjects import Formula
from rpy2.robjects.packages import importr
lattice = importr('lattice')
formula = Formula('lengths ~ percentages')
formula.getenvironment()['lengths'] = original_lengths
formula.getenvironment()['percentages'] = percentages_retained
p = lattice.xyplot(formula,
col=row_retained,
main='Title',
xlab='Percentage retained',
ylab='Original length',
sub='subtitle',
pch=18)
rprint = robjects.globalenv.get("print")
rprint(p)
It's a shame I can't get ggplot2 to work, as it produces nicer graphs by default and I regard working with dataframes as more explicit. Any help in that direction is still welcome!
If you don't have any experience with R but with python, you can use numpy or pandas for data analysis and matplotlib for plotting.
Here is a small example how "this feels like":
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'original_length': [1875, 1143, 960, 1302, 2016],
'row_retained': [False, False, False, False, True],
'percentage_retained': [11.0, 23.0, 44.0, 66.0, 87.0]})
fig, ax = plt.subplots()
ax.scatter(df.original_length, df.percentage_retained,
c=np.where(df.row_retained, 'green', 'red'),
s=np.random.randint(50, 500, 5)
)
true_value = df[df.row_retained]
ax.annotate('This one is True',
xy=(true_value.original_length, true_value.percentage_retained),
xytext=(0.1, 0.001), textcoords='figure fraction',
arrowprops=dict(arrowstyle="->"))
ax.grid()
ax.set_xlabel('Original Length')
ax.set_ylabel('Precentage Retained')
ax.margins(0.04)
plt.tight_layout()
plt.savefig('alternative.png')
pandas also has an experimental rpy2 interface.
The problem is caused by the latest ggplot2 version which is 0.9.0. This version doesn't have the function print.ggplot() which is found in ggplot2 version 0.8.9.
I tried to tinker with the rpy2 code to make it work with the newest ggplot2 but the extend of the changes seem to be quite large.
Meanwhile, just downgrade your ggplot2 version to 0.8.9

Categories

Resources