Modifying Pandas DF Class outside of Class?

Modifying Pandas DF Class outside of Class? - python

Is it possible to modify a class outside of a class?
I frequently have to add a new column to my data frames and am looking for cleaner syntax. All my dataframes come ready for this operation.
It's essentially this operation:
DF['Percent'] = float(DF['Earned'])/DF['Total']
I'd love to add this functionality like so:
DF = DF.add_percent()
Or
DF.add_percent(inplace=True)
Right now I'm only able to do something like:
DF = add_percent(DF)
where I declare add_percent as a function outside of pandas.

You can do
DF.eval('Percent = Earned / Total')
I don't think it gets much cleaner than that.

Related

Avoiding Unnecessary Class Declarations

I'm doing a ML project and decided to use classes to organize my code. Although, I'm not sure if my approach is optimal. I'll appreciate if you can share best practices, how you would approach similar challenge:
Lets concentrate on preprocessing module, where I created Preprocessor class.
This class has 3 methods for data manipulation, each taking a dataframe as input and adding a feature. Output of each method can be an input of another.
I also have 4th, wrapper method, that takes these 3 methods, chains them and creates final output:
def wrapper(self):
output = self.method_1(self.df)
output = self.method_2(output)
output = self.method_3(output)
return output
When I want to use the class, I'm creating instance with df and just call wrapper function from it. Which feels unnatural and makes me think there is a better way of doing it.
import A_class
instance = A_class(df)
output = instance.wrapper()

Classes are great if you need to keep track of/modify internal state of an object. But they're not magical things that keep your code organized just by existing. If all you have is a preprocessing pipeline that takes some data and runs it through methods in a straight line, regular functions will often be less cumbersome.
With the context you've given I'd probably do something like this:
pipelines.py
def preprocess_data_xyz(data):
"""
Takes a dataframe of nature XYZ and returns it after
running it through the necessary preprocessing steps.
"""
step_1 = func_1(data)
step_2 = func_2(step_1)
step_3 = func_3(step_2)
return step_3
def func_1(data):
"""Does X to data."""
pass
# etc ...
analysis.py
import pandas as pd
from pipelines import preprocess_data_xyz
data_xyz = pd.DataFrame( ... )
preprocessed_data_xyz = preprocess_data_xyz(data=data_xyz)
Choosing better variable and functions is also a major component of organizing your code - you should replace func_1, with a name that describes what it does to the data (something like add_numerical_column, parse_datetime_column, etc). Likewise for the data_xyz variable.

Count iterations of pandas dataframe apply function

I'm looking for a way to count how many times my database function runs.
My code looks like this
df = pd.read_csv("nodos_results3.csv")
df['URL_dashboard'] = df.apply(create_url, axis = 1)
df.to_csv('nodos_results4.csv', index = False)
I want to count how many times the function "create_url" runs. If I was in C++, for example, I would simply have the function take in another input
create_url(database i_DB, int i_count)
{
//stuff I want done to database
i_count++;
}
but I'm not sure how to do something equivalent using pandas dataframe apply.
Update: For anyone who might find this while googling in the future - I personally didn't solve this issue. I was moved to another project and didn't continue working on this. Sorry I can't be more help.

apply executes the function exactly once for each row. So, the function is executed df.shape[0] times. Correction: As per #juanpa.arrivillaga, apply executes the function twice on the first row, the correct answer is df.shape[0]+1.
Alternatively, create a global variable (say, create_url_counter=0) and increment it in the function body:
def create_url(...):
global create_url_counter
...
create_url_counter += 1
Be aware, though, that having global variables is in general a bad idea.

Possible and/or wise to assign a method to a variable?

I'm repeatedly trying to get similar data from time series dataframes. For example, the monthly (annualized) standard deviation of the series:
any_df.resample('M').std().mean()*12**(1/2)
It would save typing and would probably limit errors if these methods could be assigned to a variable, so they can be re-used - I guess this would look something like
my_stdev = .resample('M').std().mean()*12**(1/2)
result = any_df.my_stdev()
Is this possible and if so is it sensible?
Thanks in advance!

Why not just make your own function?
def my_stdev(df):
return df.resample('M').std().mean()*12**(1/2)
result = my_stdev(any_df)

Access Shared DataFrame in Multiprocessing Map

I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.
Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.

You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)

associate decorated DataFrame with all pandas functions

I wanted to add a unique id to my DataFrames, and I essentially succeeded by using what I found here, Python Class Decorator. I know from here https://github.com/pydata/pandas/issues/2485 that adding custom metadata is not yet explicitly supported, but decorators seemed like a workaround.
My decorated DataFrames return new and similarly decorated DataFrames when I use methods such as copy and groupby.agg. How can I have "all" pandas functions like pd.DataFrame() or pd.read_csv return my decorated DataFrames instead of original, undecorated DataFrames without decorating each pandas function individually? I.e., how can I have my decorated DataFrames replace the stock DataFrames?
Here's my code. First, I have an enhanced pandas module, wrapPandas.py.
from pandas import *
import numpy as np
def addId(cls):
class withId(cls):
def __init__(self, *args, **kargs):
super(withId, self).__init__(*args, **kargs)
self._myId = np.random.randint(0,99999)
return withId
pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)
Running the following snippet of code shows my DataFrame returning decorated DataFrames when I use methods such as .copy() and .groupby().agg(). I will then follow this up by showing that pandas functions such as pd.DataFrame don't return my decorated DataFrames (sadly though not surprisingly).
EDIT: added import statement per Jonathan Eunice's response.
import wrapPandas as pd
d = {
'strCol': ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A'],
'intCol': [6,3,8,6,7,3,9,2,6],
}
#create "decorated" DataFrame
dfFoo = pd.core.frame.DataFrame.from_records(d)
print("dfFoo._myId = {}".format(dfFoo._myId))
#new DataFrame with new ._myId
dfBat = dfFoo.copy()
print("dfBat._myId = {}".format(dfBat._myId))
#new binding for old DataFrame, keeps old ._myId
dfRat = dfFoo
print("dfRat._myId = {}".format(dfRat._myId))
#new DataFrame with new ._myId
dfBird = dfFoo.groupby('strCol').agg({'intCol': 'sum'})
print("dfBird._myId = {}".format(dfBird._myId))
#all of these new DataFrames have the same type, "withId"
print("type(dfFoo) = {}".format(type(dfFoo)))
And this yields the following results.
dfFoo._myId = 66622
dfBat._myId = 22527
dfRat._myId = 66622
dfBird._myId = 97593
type(dfFoo) = <class 'wrapPandas.withId'>
And the sad part. dfBoo._myId raises, of course, an AttributeError.
#create "stock" DataFrame
dfBoo = pd.DataFrame(d)
print(type(dfBoo))
#doesn't have a ._myId (I wish it did, though)
print(dfBoo._myId)

Modify your monkey patch to:
pd.DataFrame = pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)
I.e. so you are "latching on" or "monkey patching" two different names.
This need to double-assign may seem weird, given that pandas.core.frame.DataFrame is pd.DataFrame. But you are not actually modifying the DataFrame class. You are injecting a proxy class. Whatever references are to the proxy worked. The ones that were direct to the original class did not get the proxy behavior. Change that by having all the names you might want to use point to the proxy.
Here's how it looks more diagrammatically:
I assume you also have an import pandas as pd somewhere in your file that's not shown, else your definition of dfBoo would fail with NameError: name 'pd' is not defined.
Monkey patching is dangerous for reasons like this. You're injecting things...and it's impossible to know if you "caught all the references" or "patched everything you need to." I can't promise that there won't be other calls in the code that address structures at a lower level than this name rejiggering won't effect. But for the code displayed, it works!
Update You later asked how to make this work for pd.read_csv. Well, that's yet another of the places you might need to monkey patch. In this case, amend the patch code above to:
pd.DataFrame = pandas.io.parsers.DataFrame = pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)
Patching the definition of DataFrame inside pandas.io.parsers.DataFrame will do the trick for read_csv. Same caveat applies: There could be (i.e. probably are) more uses you'd need to track down for full coverage.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modifying Pandas DF Class outside of Class? - python

You can do DF.eval('Percent = Earned / Total') I don't think it gets much cleaner than that.

Related

Avoiding Unnecessary Class Declarations

Count iterations of pandas dataframe apply function

Possible and/or wise to assign a method to a variable?

Access Shared DataFrame in Multiprocessing Map

associate decorated DataFrame with all pandas functions

Categories

Resources