Display pandas dataframe using custom style inside function in IPython - python

In a jupyter notebook, I have a function which prepares the input features and targets matrices for a tensorflow model.
Inside this function, I would like to display a correlation matrix with a background gradient to better see the strongly correlated features.
This answer shows how to do that exactly how I want to do it. The problem is that from the inside of a function I cannot get any output, i.e. this:
def display_corr_matrix_custom():
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')
display_corr_matrix_custom()
clearly does not show anything. Normally, I use IPython's display.display() function. In this case, however, I cannot use it since I want to retain my custom background.
Is there another way to display this matrix (if possible, without matplotlib) without returning it?
EDIT: Inside my real function, I also display other stuff (as data description) and I would like to display the correlation matrix at a precise location. Furthermore, my function returns many dataframes, so returning the matrix as proposed by #brentertainer does not directly display the matrix.

You mostly have it. Two changes:
Get the Styler object based from corr.
Display the styler in the function using IPython's display.display()
def display_corr_matrix_custom():
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr() # corr is a DataFrame
styler = corr.style.background_gradient(cmap='coolwarm') # styler is a Styler
display(styler) # using Jupyter's display() function
display_corr_matrix_custom()

Related

How to make statsmodels GLM.fit_constrained result picklable/store-and-reloadable

A GLS (or thus also OLS) regression with constraints on parameters can readily be run using statsmodels GLM.fit_constrained() method, as with the code below (or here).
How can I make the GLMresults object resulting from such a statsmodels GLM.fit_constrained() regression picklable, so that the estimation result can be stored for re-use for prediction in a new session anytime later?
The GLMresults object obtained from fit_constrained() and containing the relevant estimation result has its .save() method that would normally readily pickle the object into a file.
This .save() works for the result from a standard (unconstrained) GLM regression, sm.glm.fit(). However, it doesn't work with the result for sm.glm.fit_unconstrained(). Instead, it throws a pickling error, seemingly because patsy DesignMatrixBuilder is not Picklable, so it links to the never resolved issue here. This at least for my Python 3.6.3 (running on Windows).
An example:
import statsmodels
import statsmodels.api as sm
import pandas as pd
# Define exapmle data & Constraints:
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 5)), columns=list('ABCDF'))
y = df['A']
X = df[['B','C','D','F']]
constraints = ['B + C + D', 'C - F'] # Add two linear constraints on parameters: B+C+D = 0 & C-F = 0
statsmodels.genmod.families.links.identity()
OLS_from_GLM = sm.GLM(y, X)
# Unconstrained regression:
result_u = OLS_from_GLM.fit()
result_u.save('myfile_u.pickle') # This works
# Constrained regression - save() fails
result_c = OLS_from_GLM.fit_constrained(constraints)
result_c.save('myfile_c.pickle') # This fails with pickling error (tested in Python 3.6.3 on Windows): "NotImplementedError: Sorry, pickling not yet supported. See https://github.com/pydata/patsy/issues/26 if you want to help."
Is there a way to readily make the result from fit_unconstrained() picklable i.e./or storable?
I below suggest a first workaround answer; it is trivial and works well for me so far. I do not know, however, whether it is truly advisable or whether its risks are large and/or any preferable alternative solution exists.
I got this to work by simply removing (commenting out) the line
res._results.constraints = lc
in the function definition of fit_constrained() within statsmodels' active generalized_linear_model.py script (in my case in the virtualenv folder \env\Lib\site-packages\statsmodels\genmod\generalized_linear_model.py).
Idling this line seems to have created no problem for my work; I can now readily save and reload the pickled file and use it to make correct predictions based on the stored estimation; the imposed parameter constraints remain respected and predictions made using .predict() remain unchanged after reloading.
I wonder though whether there is any major risk attached to this procedure. I am not familiar with the inner workings of the statsmodels library, or with its glm.fit_constrained() method in particular. i reckon it's unadvisable to change anything in a pre-existing module one does not understand. However, it is the only way I am conveniently able to impose various constraints to my GLM parameters and to be able to save the regression results to readily re-use it for prediction in a later session.

Possible and/or wise to assign a method to a variable?

I'm repeatedly trying to get similar data from time series dataframes. For example, the monthly (annualized) standard deviation of the series:
any_df.resample('M').std().mean()*12**(1/2)
It would save typing and would probably limit errors if these methods could be assigned to a variable, so they can be re-used - I guess this would look something like
my_stdev = .resample('M').std().mean()*12**(1/2)
result = any_df.my_stdev()
Is this possible and if so is it sensible?
Thanks in advance!
Why not just make your own function?
def my_stdev(df):
return df.resample('M').std().mean()*12**(1/2)
result = my_stdev(any_df)

How to inspect a numpy/pandas object, i.e. str() in R

When I use R, I can use str() to inspect objects which are a list of things most of the times.
I recently switched to Python for statistics and don't know how to inspect the objects I encounter. For example:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
heart.groupby(['censors'])['age']
I want to investigate what kind of object is heart.groupby(['censors']) that allows me to add ['age'] at the end. However, print heart.groupby(['censors']) only tells me the type of the object, not its structure and what I can do with it.
So how do I get to understand the structure of numpy / pandas object, similar to str() in R?
If you're trying to get some insight into what you can do with a Python object, you can inspect it using a beefed-up Python console like IPython. In an IPython session, first put the object you want to look at into a variable:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
h_grouped = heart.groupby(['censors'])
Then type out the variable name and double-tap Tab to bring up a list of the object's methods:
In [5]: h_grouped.<Tab><Tab>
# Shows the object's methods
A further benefit of the IPython console is you can quickly check the
help for any individual method by adding a ?:
h_grouped.apply?
# Apply function and combine results
# together in an intelligent way.
If you don't have IPython or a similar console, you can achieve something similar using dir(), e.g. dir(h_grouped), although this will also list
the object's private methods which are generally not useful and shouldn't be
touched in regular use.
type(heart.groupby(['censors'])['age'])
type will tell you what kind of object it is. At the moment you are grouping by a dimension and not telling pandas what to do with age. If you want the mean for example you could do:
heart.groupby(['censors'])['age'].mean()
This would take the mean of age by the group, and return a series.
The groupby is I think a red herring -- "age" is just a column name:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
heart
# survival censors age
# 0 15 1 54.3
# ...
heart.keys()
# Index([u'survival', u'censors', u'age'], dtype='object')

Python: KeyError 'shift'

I am new to Python and try to modify a pair trading script that I found here:
https://github.com/quantopian/zipline/blob/master/zipline/examples/pairtrade.py
The original script is designed to use only prices. I would like to use returns to fit my models and price for invested quantity but I don't see how do it.
I have tried:
to define a data frame of returns in the main and call it in run
to define a data frame of returns in the main as a global object and use where needed in the 'handle data'
to define a data frame of retuns directly in the handle data
I assume the last option to be the most appropriate but then I have an error with panda 'shift' attribute.
More specifically I try to define 'DataRegression' as follow:
DataRegression = data.copy()
DataRegression[Stock1]=DataRegression[Stock1]/DataRegression[Stock1].shift(1)-1
DataRegression[Stock2]=DataRegression[Stock2]/DataRegression[Stock2].shift(1)-1
DataRegression[Stock3]=DataRegression[Stock3]/DataRegression[Stock3].shift(1)-1
DataRegression = DataRegression.dropna(axis=0)
where 'data' is a data frame which contains prices, stock1, stock2 and stock3 column names defined globally. Those lines in the handle data return the error:
File "A:\Apps\Python\Python.2.7.3.x86\lib\site-packages\zipline-0.5.6-py2.7.egg\zipline\utils\protocol_utils.py", line 85, in __getattr__
return self.__internal[key]
KeyError: 'shift'
Would anyone know why and how to do that correctly?
Many Thanks,
Vincent
This is an interesting idea. The easiest way to do this in zipline is to use the Returns transform which adds a returns field to the event-frame (which is an ndict, not a pandas DataFrame as someone pointed out).
For this you have to add the transform to the initialize method:
self.add_transform(Returns, 'returns', window_length=1)
(make sure to add from zipline.transforms import Returns at the beginning).
Then, inside the batch_transform you can access returns instead of prices:
#batch_transform
def ols_transform(data, sid1, sid2):
"""Computes regression coefficient (slope and intercept)
via Ordinary Least Squares between two SIDs.
"""
p0 = data.returns[sid1]
p1 = sm.add_constant(data.returns[sid2])
slope, intercept = sm.OLS(p0, p1).fit().params
return slope, intercept
Alternatively, you could also create a batch_transform to convert prices to returns like you wanted to do.
#batch_transform
def returns(data):
return data.price / data.price.shift(1) - 1
And then pass that to the OLS transform. Or do this computation inside of the OLS transform itself.
HTH,
Thomas

How to define atom for Pytables EArray creation

Trying to create a Pytables EArray on the run based on one column from a numpy recarray. This seems to work if I am using createArray as I can simply pass it the numpy array extracted from the recarray. However, for the createEArray I need to define the atom - which is causing problems
In the example MyRecArray is a recordarray with 1-D arrays for columns, Myhdf5 is a predefined Pytables file, and Mynode is a predefined group in that file from which the EArray leaves will hang.
Myfield = MyRecArray[Colname]
afieldtype = Myfield.dtype
Myatom = tables.atom.Atom(afieldtype, (1,), -9999)
MyEarray = Myhdf5.createEArray(Mynode, Colname, Myatom, (0,))
MyEarray.append(Myfield )
MyEarray.flush()
MyEarray.close()
using this code give the error:
NotImplementedError: ``Atom`` is an abstract class;
please use one of its subclasses
I can probably write a subroutine with case statements based on the array time and pass back an atom, but I was just wondering if there is a generic way to create such an atom by passing it the array type to be created instead of having to call a specific function for different data types, such as "tables.atom.FloatAtom(....)"
Thanks
I believe using the function:
tables.Atom.from_dtype(afieldtype, dflt=-9999)
will allow you to create an atom without going the subroutine route. The shape is contained in the dtype "afieldtype" (eg. dtype([('col1', '<f8', (10,))]))

Categories

Resources