I wanted to add a unique id to my DataFrames, and I essentially succeeded by using what I found here, Python Class Decorator. I know from here https://github.com/pydata/pandas/issues/2485 that adding custom metadata is not yet explicitly supported, but decorators seemed like a workaround.
My decorated DataFrames return new and similarly decorated DataFrames when I use methods such as copy and groupby.agg. How can I have "all" pandas functions like pd.DataFrame() or pd.read_csv return my decorated DataFrames instead of original, undecorated DataFrames without decorating each pandas function individually? I.e., how can I have my decorated DataFrames replace the stock DataFrames?
Here's my code. First, I have an enhanced pandas module, wrapPandas.py.
from pandas import *
import numpy as np
def addId(cls):
class withId(cls):
def __init__(self, *args, **kargs):
super(withId, self).__init__(*args, **kargs)
self._myId = np.random.randint(0,99999)
return withId
pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)
Running the following snippet of code shows my DataFrame returning decorated DataFrames when I use methods such as .copy() and .groupby().agg(). I will then follow this up by showing that pandas functions such as pd.DataFrame don't return my decorated DataFrames (sadly though not surprisingly).
EDIT: added import statement per Jonathan Eunice's response.
import wrapPandas as pd
d = {
'strCol': ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A'],
'intCol': [6,3,8,6,7,3,9,2,6],
}
#create "decorated" DataFrame
dfFoo = pd.core.frame.DataFrame.from_records(d)
print("dfFoo._myId = {}".format(dfFoo._myId))
#new DataFrame with new ._myId
dfBat = dfFoo.copy()
print("dfBat._myId = {}".format(dfBat._myId))
#new binding for old DataFrame, keeps old ._myId
dfRat = dfFoo
print("dfRat._myId = {}".format(dfRat._myId))
#new DataFrame with new ._myId
dfBird = dfFoo.groupby('strCol').agg({'intCol': 'sum'})
print("dfBird._myId = {}".format(dfBird._myId))
#all of these new DataFrames have the same type, "withId"
print("type(dfFoo) = {}".format(type(dfFoo)))
And this yields the following results.
dfFoo._myId = 66622
dfBat._myId = 22527
dfRat._myId = 66622
dfBird._myId = 97593
type(dfFoo) = <class 'wrapPandas.withId'>
And the sad part. dfBoo._myId raises, of course, an AttributeError.
#create "stock" DataFrame
dfBoo = pd.DataFrame(d)
print(type(dfBoo))
#doesn't have a ._myId (I wish it did, though)
print(dfBoo._myId)
Modify your monkey patch to:
pd.DataFrame = pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)
I.e. so you are "latching on" or "monkey patching" two different names.
This need to double-assign may seem weird, given that pandas.core.frame.DataFrame is pd.DataFrame. But you are not actually modifying the DataFrame class. You are injecting a proxy class. Whatever references are to the proxy worked. The ones that were direct to the original class did not get the proxy behavior. Change that by having all the names you might want to use point to the proxy.
Here's how it looks more diagrammatically:
I assume you also have an import pandas as pd somewhere in your file that's not shown, else your definition of dfBoo would fail with NameError: name 'pd' is not defined.
Monkey patching is dangerous for reasons like this. You're injecting things...and it's impossible to know if you "caught all the references" or "patched everything you need to." I can't promise that there won't be other calls in the code that address structures at a lower level than this name rejiggering won't effect. But for the code displayed, it works!
Update You later asked how to make this work for pd.read_csv. Well, that's yet another of the places you might need to monkey patch. In this case, amend the patch code above to:
pd.DataFrame = pandas.io.parsers.DataFrame = pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)
Patching the definition of DataFrame inside pandas.io.parsers.DataFrame will do the trick for read_csv. Same caveat applies: There could be (i.e. probably are) more uses you'd need to track down for full coverage.
Related
I am looking to retrieve the name of an instance of DataFrame, that I pass as an argument to my function, to be able to use this name in the execution of the function.
Example in a script:
display(df_on_step_42)
I would like to retrieve the string "df_on_step_42" to use in the execution of the display function (that display the content of the DataFrame).
As a last resort, I can pass as argument of DataFrame and its name:
display(df_on_step_42, "df_on_step_42")
But I would prefer to do without this second argument.
PySpark DataFrames are non-transformable, so in our data pipeline, we cannot systematically put a name attribute to all the new DataFrames that come from other DataFrames.
You can use the globals() dictionary to search for your variable by matching it using eval.
As #juanpa.arrivillaga mentions, this is fundamentally bad design, but if you need to, here is one way to do this inspired by this old SO answer for python2 -
import pandas as pd
df_on_step_42 = pd.DataFrame()
def get_var_name(var):
for k in globals().keys():
try:
if eval(k) is var:
return k
except:
pass
get_var_name(df_on_step_42)
'df_on_step_42'
Your display would then look like -
display(df_on_step_42, get_var_name(df_on_step_42))
Caution
This will fail for views of variables since they are just pointing to the memory of the original variable. This means that the original variable occurs first in the global dictionary during an iteration of the keys, it will return the name of the original variable.
a = 123
b = a
get_var_name(b)
'a'
I finally found a solution to my problem using the inspect and re libraries.
I use the following lines which correspond to the use of the display() function
import inspect
import again
def display(df):
frame = inspect.getouterframes(inspect.currentframe())[1]
name = re.match("\s*(\S*).display", frame.code_context[0])[1]
print(name)
display(df_on_step_42)
The inspect library allows me to get the call context of the function, in this context, the code_context attribute gives me the text of the line where the function is called, and finally the regex library allows me to isolate the name of the dataframe given as parameter.
It’s not optimal but it works.
I'm doing a ML project and decided to use classes to organize my code. Although, I'm not sure if my approach is optimal. I'll appreciate if you can share best practices, how you would approach similar challenge:
Lets concentrate on preprocessing module, where I created Preprocessor class.
This class has 3 methods for data manipulation, each taking a dataframe as input and adding a feature. Output of each method can be an input of another.
I also have 4th, wrapper method, that takes these 3 methods, chains them and creates final output:
def wrapper(self):
output = self.method_1(self.df)
output = self.method_2(output)
output = self.method_3(output)
return output
When I want to use the class, I'm creating instance with df and just call wrapper function from it. Which feels unnatural and makes me think there is a better way of doing it.
import A_class
instance = A_class(df)
output = instance.wrapper()
Classes are great if you need to keep track of/modify internal state of an object. But they're not magical things that keep your code organized just by existing. If all you have is a preprocessing pipeline that takes some data and runs it through methods in a straight line, regular functions will often be less cumbersome.
With the context you've given I'd probably do something like this:
pipelines.py
def preprocess_data_xyz(data):
"""
Takes a dataframe of nature XYZ and returns it after
running it through the necessary preprocessing steps.
"""
step_1 = func_1(data)
step_2 = func_2(step_1)
step_3 = func_3(step_2)
return step_3
def func_1(data):
"""Does X to data."""
pass
# etc ...
analysis.py
import pandas as pd
from pipelines import preprocess_data_xyz
data_xyz = pd.DataFrame( ... )
preprocessed_data_xyz = preprocess_data_xyz(data=data_xyz)
Choosing better variable and functions is also a major component of organizing your code - you should replace func_1, with a name that describes what it does to the data (something like add_numerical_column, parse_datetime_column, etc). Likewise for the data_xyz variable.
I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.
Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.
You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)
Is it possible to modify a class outside of a class?
I frequently have to add a new column to my data frames and am looking for cleaner syntax. All my dataframes come ready for this operation.
It's essentially this operation:
DF['Percent'] = float(DF['Earned'])/DF['Total']
I'd love to add this functionality like so:
DF = DF.add_percent()
Or
DF.add_percent(inplace=True)
Right now I'm only able to do something like:
DF = add_percent(DF)
where I declare add_percent as a function outside of pandas.
You can do
DF.eval('Percent = Earned / Total')
I don't think it gets much cleaner than that.
When I use R, I can use str() to inspect objects which are a list of things most of the times.
I recently switched to Python for statistics and don't know how to inspect the objects I encounter. For example:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
heart.groupby(['censors'])['age']
I want to investigate what kind of object is heart.groupby(['censors']) that allows me to add ['age'] at the end. However, print heart.groupby(['censors']) only tells me the type of the object, not its structure and what I can do with it.
So how do I get to understand the structure of numpy / pandas object, similar to str() in R?
If you're trying to get some insight into what you can do with a Python object, you can inspect it using a beefed-up Python console like IPython. In an IPython session, first put the object you want to look at into a variable:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
h_grouped = heart.groupby(['censors'])
Then type out the variable name and double-tap Tab to bring up a list of the object's methods:
In [5]: h_grouped.<Tab><Tab>
# Shows the object's methods
A further benefit of the IPython console is you can quickly check the
help for any individual method by adding a ?:
h_grouped.apply?
# Apply function and combine results
# together in an intelligent way.
If you don't have IPython or a similar console, you can achieve something similar using dir(), e.g. dir(h_grouped), although this will also list
the object's private methods which are generally not useful and shouldn't be
touched in regular use.
type(heart.groupby(['censors'])['age'])
type will tell you what kind of object it is. At the moment you are grouping by a dimension and not telling pandas what to do with age. If you want the mean for example you could do:
heart.groupby(['censors'])['age'].mean()
This would take the mean of age by the group, and return a series.
The groupby is I think a red herring -- "age" is just a column name:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
heart
# survival censors age
# 0 15 1 54.3
# ...
heart.keys()
# Index([u'survival', u'censors', u'age'], dtype='object')