How to use pandas .describe() - python

I have used pandas in the past but I have recently run into a problem where my code is not displaying the .head() or the .describe() function. I have copied my code below from another website and it is still not displaying. Any help is appreciated.
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
training_df = pd.read_csv(filepath_or_buffer="california_housing_train.csv")
training_df["median_house_value"] /= 1000.0
training_df.describe(include = 'all')

Your answer will work in a notebook or REPL, but doesn't actually print. Make sure to call the print() function to output while running.

Related

Python Pandas leaking memory?

I have a script that is constantly measuring some data and regularly storing it in a file. In the past I was storing the data in a "manually created CSV" file in this way (pseudocode):
with open('data.csv','w') as ofile:
print('var1,var2,var3,...,varN', file=ofile) # Create CSV header.
while measure:
do_something()
print(f'{var1},{var2},{var3},...,{varN}', file=ofile) # Store data.
I worked in this way for several months and runned this script several hundreds of times with no issues other than 1) this is cumbersome (and prone to errors) when N is large (in my case between 20 and 30) and 2) CSV does not preserve data types. So I decided to change to something like this:
temporary_df = pandas.DataFrame()
while measure:
do_something()
temporary_df.append({'var1':var1,'var2':var2,...,'varN':varN}, ignore_index=True)
if save_data_in_this_iteration():
temporary_df.to_feather(f'file_{datetime.datetime.now()}.fd')
temporary_df = pandas.DataFrame() # Clean the dataframe.
merge_all_feathers_into_single_feather()
At a first glance this was working perfectly as I expected. However, after some hours Python crashes. After experiencing this both in a Windows and in a (separate) Linux machine I, I noticed that the problem is that Python is slowly sucking the memory of the machine until there is no more memory, and then of course it crashes.
As the function do_something is unchanged between the two approaches, and the crash happens before merge_all_feathers_into_single_feather is called, and save_data_in_this_iteration is trivially simple, I am blaming Pandas for this problem.
Google has told me that other people in the past have had memory problems while using Pandas. I have tried adding the garbage collector line in each iteration, as suggested e.g. here, but did not worked for me. I didn't tried the mutiprocessing approach yet because it looks like killing an ant with a nuke, and may bring other complications...
Is there any solution to keep using Pandas like this? Is there a better solution to this without using Pandas? Which?
Pandas was not the problem
After struggling with this problem for a while, I decided to create a MWE to do some tests. So I wrote this:
import pandas
import numpy
import datetime
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
df = df[0:0]
To my surprise, the memory is not drained by this script! So here I concluded that Pandas was not my problem.
Then I added a new component to the MWE and I found the issue:
import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
def save_matplotlib_plot(df):
fig, ax = plt.subplots()
ax.plot(df['col_1'], df['col_2'])
fig.savefig('delete_me.png')
# Uncomment the following two lines to release the memory and stop the "leak".
# ~ fig.clear()
# ~ plt.close(fig)
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
save_matplotlib_plot(df) # Here I had my "leak" (which was not a leak indeed because matplotlib keeps track of all the figures it creates, so it was working as expected).
df = df[0:0]
It seems that when I switched from "handmade CSV" to "Pandas" I also changed something with the plots, so I was blaming Pandas when it was not the problem.
Just for completeness, the multiprocessing solution also works. The following script has no memory issues:
import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
from multiprocessing import Process
def save_matplotlib_plot(df):
fig, ax = plt.subplots()
ax.plot(df['col_1'], df['col_2'])
fig.savefig('delete_me.png')
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
p = Process(target=save_matplotlib_plot, args=(df,))
p.start()
p.join()
df = df[0:0]

Codes in Ipython vs Pycharm

I am a newbie and the following question may be dumb and not well written.
I tried the following block of codes in Ipython:
%pylab qt5
x = randn(100,100)
y = mean(x,0)
import seaborn
plot(y)
And it delivered a plot. Everything was fine.
However, when I copied and pasted those same lines of codes to Pycharm and tried running, syntax error messages appeared.
For instance,
%pylab was not recognized.
Then I tried to import numpy and matplotlib one by one. But then,
randn(.,.) was not recognized.
You can use IPython/Jupyter notebooks in PyCharm by following this guide:
https://www.jetbrains.com/help/pycharm/using-ipython-jupyter-notebook-with-pycharm.html
You may modify code like the snippet below in order to run in PyCharm:
from numpy.random import randn
from numpy import mean
import seaborn
x = randn(10, 10)
y = mean(x, 0)
seaborn.plt.plot(x)
seaborn.plt.show()

statsmodels package code works in spyder's ipython console but not in python script

This is a python script in the Spyder IDE of Anaconda. I have Python 3.6.2. The last two lines do nothing (apparently) when I run the script, but work if I type them in the IPython console of Spyder. How do I get them to work in the script please?
# import packages
import os # misc operating system functions
import sys # system parameters and functions
import pandas as pd # data frame handling
import statsmodels.formula.api as sm # stats module
import matplotlib.pyplot as plt # matlab-like plotting
import numpy as np # big, fast arrays for maths
# set working directory to here
os.chdir(os.path.dirname(sys.argv[0]))
# read data using pandas
datafile = '1314 Powerview Pasture Potential.csv'
data1 = pd.read_csv(datafile)
#list(data) # to show column names
# plot some data
# don't know how to pop out a separate window
plt.scatter(data1['long'],data1['lat'],s=40,facecolors='none',edgecolors='b')
# simple multiple regression
Y = data1[['Pasture and Crop eaten t DM/ha']]
X = data1[['Net imported Supplements per Ha',
'LWT/ha',
'lat']]
result = sm.OLS(Y,X).fit()
result.summary()

How to use tqdm with pandas in a jupyter notebook?

I'm doing some analysis with pandas in a jupyter notebook and since my apply function takes a long time I would like to see a progress bar.
Through this post here I found the tqdm library that provides a simple progress bar for pandas operations.
There is also a Jupyter integration that provides a really nice progress bar where the bar itself changes over time.
However, I would like to combine the two and don't quite get how to do that.
Let's just take the same example as in the documentation
import pandas as pd
import numpy as np
from tqdm import tqdm
df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="my bar!")
# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
df.progress_apply(lambda x: x**2)
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)
It even says "can use 'tqdm_notebook' " but I don't find a way how.
I've tried a few things like
tqdm_notebook(tqdm.pandas(desc="my bar!"))
or
tqdm_notebook.pandas
but they don't work.
In the definition it looks to me like
tqdm.pandas(tqdm_notebook(desc="my bar!"))
should work, but the bar doesn't properly show the progress and there is still additional output.
Any other ideas?
My working solution (copied from the documentation):
from tqdm.auto import tqdm
tqdm.pandas()
You can use:
tqdm_notebook().pandas(*args, **kwargs)
This is because tqdm_notebook has a delayer adapter, so it's necessary to instanciate it before accessing its methods (including class methods).
In the future (>v5.1), you should be able to use a more uniform API:
tqdm_pandas(tqdm_notebook, *args, **kwargs)
I found that I had to import tqdm_notebook also. A simple example is given below that works in Jupyter notebook.
Given you want to map a function on a variable to create a new variable in your pandas dataframe.
# progress bar
from tqdm import tqdm, tqdm_notebook
# instantiate
tqdm.pandas(tqdm_notebook)
# replace map with progress_map
# where df is a pandas dataframe
df['new_variable'] = df['old_variable'].progress_map(some_function)
If you want to use more than 1 CPU for that slow apply step, consider using swifter. As a bonus, swifter automatically enables a tqdm progress bar on the apply step. To customize the bar description, use :
df.swifter.progress_bar(enable=True, desc='bar description').apply(...)
from tqdm.notebook import tqdm
tqdm.pandas()
for versions 4.64.0 and greater.

Pandas plot doesn't show

When using this in a script (not IPython), nothing happens, i.e. the plot window doesn't appear :
import numpy as np
import pandas as pd
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts.plot()
Even when adding time.sleep(5), there is still nothing. Why?
Is there a way to do it, without having to manually call matplotlib ?
Once you have made your plot, you need to tell matplotlib to show it. The usual way to do things is to import matplotlib.pyplot and call show from there:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts.plot()
plt.show()
In older versions of pandas, you were able to find a backdoor to matplotlib, as in the example below. NOTE: This no longer works in modern versions of pandas, and I still recommend importing matplotlib separately, as in the example above.
import numpy as np
import pandas as pd
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts.plot()
pd.tseries.plotting.pylab.show()
But all you are doing there is finding somewhere that matplotlib has been imported in pandas, and calling the same show function from there.
Are you trying to avoid calling matplotlib in an effort to speed things up? If so then you are really not speeding anything up, since pandas already imports pyplot:
python -mtimeit -s 'import pandas as pd'
100000000 loops, best of 3: 0.0122 usec per loop
python -mtimeit -s 'import pandas as pd; import matplotlib.pyplot as plt'
100000000 loops, best of 3: 0.0125 usec per loop
Finally, the reason the example you linked in comments doesn't need the call to matplotlib is because it is being run interactively in an iPython notebook, not in a script.
In case you are using matplotlib, and still, things don't show up in iPython notebook (or Jupyter Lab as well) remember to set the inline option for matplotlib in the notebook.
import matplotlib.pyplot as plt
%matplotlib inline
Then the following code will work flawlessly:
fig, ax = plt.subplots(figsize=(16,9));
change_per_ins.plot(ax=ax, kind='hist')
If you don't set the inline option it won't show up and by adding a plt.show() in the end you will get duplicate outputs.
I did just
import matplotlib.pyplot as plt
%matplotlib inline
and add line
plt.show()
next to df.plot() and it worked well for
The other answers involve importing matplotlib.pyplot and/or calling some second function manually.
Instead, you can configure matplotlib to be in interactive mode with its configuration files.
Simply add the line
interactive: True
to a file called matplotlibrc in one of the following places:
In the current working directory
In the platform specific user directory specified by matplotlib.get_configdir()
On unix-like system, typically /home/username/.config/matplotlib/
On Windows C:\\Documents and Settings\\username\\.matplotlib\\

Categories

Resources