I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.
Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.
You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)
Related
I am looking to retrieve the name of an instance of DataFrame, that I pass as an argument to my function, to be able to use this name in the execution of the function.
Example in a script:
display(df_on_step_42)
I would like to retrieve the string "df_on_step_42" to use in the execution of the display function (that display the content of the DataFrame).
As a last resort, I can pass as argument of DataFrame and its name:
display(df_on_step_42, "df_on_step_42")
But I would prefer to do without this second argument.
PySpark DataFrames are non-transformable, so in our data pipeline, we cannot systematically put a name attribute to all the new DataFrames that come from other DataFrames.
You can use the globals() dictionary to search for your variable by matching it using eval.
As #juanpa.arrivillaga mentions, this is fundamentally bad design, but if you need to, here is one way to do this inspired by this old SO answer for python2 -
import pandas as pd
df_on_step_42 = pd.DataFrame()
def get_var_name(var):
for k in globals().keys():
try:
if eval(k) is var:
return k
except:
pass
get_var_name(df_on_step_42)
'df_on_step_42'
Your display would then look like -
display(df_on_step_42, get_var_name(df_on_step_42))
Caution
This will fail for views of variables since they are just pointing to the memory of the original variable. This means that the original variable occurs first in the global dictionary during an iteration of the keys, it will return the name of the original variable.
a = 123
b = a
get_var_name(b)
'a'
I finally found a solution to my problem using the inspect and re libraries.
I use the following lines which correspond to the use of the display() function
import inspect
import again
def display(df):
frame = inspect.getouterframes(inspect.currentframe())[1]
name = re.match("\s*(\S*).display", frame.code_context[0])[1]
print(name)
display(df_on_step_42)
The inspect library allows me to get the call context of the function, in this context, the code_context attribute gives me the text of the line where the function is called, and finally the regex library allows me to isolate the name of the dataframe given as parameter.
It’s not optimal but it works.
Fairly new to Dask but just wondering why it is behaving in such strange way. Essentially, I create a new column with random uuids and join it to another dask dataframe. For some odd reason the uuids keep changing and not sure if I am missing something?
This is a representation of my code:
def generate_uuid() -> str:
""" generates uuid4 id """
return str(uuid4())
my_dask_data = dd.from_pandas(my_pandas_data, npartitions=4)
my_dask_data["uuid"] = None
my_dask_data["uuid"] = my_dask_data.apply(generate_uuid, axis=1, meta=("uuid"), "str"))
print(my_dask_data.compute())
And this is the output:
name uuid
my_name_1 16fb858c-bbed-413b-a415-62099ee2c455
my_name_2 9acd0a22-9b19-4db6-9759-b70dc0353710
my_name_3 5d610aaf-a813-4d0b-8d83-8f11fe400c7e
Then, I do a concat with other dask dataframe:
joined_data = dd.concat([my_dask_data, my_other_dask_data], axis=1)
print(joined_data.compute())
This is the output, which for some reason it produces new uuids:
name uuid tests
my_name_1 f951cefa-1145-411c-96f6-924730d7cb22 test1
my_name_2 88e28e5f-42ea-4fbe-a036-b8179a0ba3f8 test2
my_name_3 50e70fac-da19-4d2f-b6ea-80da41591ac5 test3
Any thoughts on how to keep the same uuids without changing?
Dask does not keep your data in memory, by design - this is a huge attractive feature of dask. So every time you compute, your function will be executed again. Since uuid4() is based on a random number generator, different results each time are expected. In fact, UUIDs are never supposed to repeat.
The question is, what would you like to happen, what is your actual workflow? You might be interested in reading this SO question: How to generate a random UUID which is reproducible (with a seed) in Python
I'm doing a ML project and decided to use classes to organize my code. Although, I'm not sure if my approach is optimal. I'll appreciate if you can share best practices, how you would approach similar challenge:
Lets concentrate on preprocessing module, where I created Preprocessor class.
This class has 3 methods for data manipulation, each taking a dataframe as input and adding a feature. Output of each method can be an input of another.
I also have 4th, wrapper method, that takes these 3 methods, chains them and creates final output:
def wrapper(self):
output = self.method_1(self.df)
output = self.method_2(output)
output = self.method_3(output)
return output
When I want to use the class, I'm creating instance with df and just call wrapper function from it. Which feels unnatural and makes me think there is a better way of doing it.
import A_class
instance = A_class(df)
output = instance.wrapper()
Classes are great if you need to keep track of/modify internal state of an object. But they're not magical things that keep your code organized just by existing. If all you have is a preprocessing pipeline that takes some data and runs it through methods in a straight line, regular functions will often be less cumbersome.
With the context you've given I'd probably do something like this:
pipelines.py
def preprocess_data_xyz(data):
"""
Takes a dataframe of nature XYZ and returns it after
running it through the necessary preprocessing steps.
"""
step_1 = func_1(data)
step_2 = func_2(step_1)
step_3 = func_3(step_2)
return step_3
def func_1(data):
"""Does X to data."""
pass
# etc ...
analysis.py
import pandas as pd
from pipelines import preprocess_data_xyz
data_xyz = pd.DataFrame( ... )
preprocessed_data_xyz = preprocess_data_xyz(data=data_xyz)
Choosing better variable and functions is also a major component of organizing your code - you should replace func_1, with a name that describes what it does to the data (something like add_numerical_column, parse_datetime_column, etc). Likewise for the data_xyz variable.
To provide a bit of context, I am building a risk model that pulls data from various different sources. Initially I wrote the model as a single function that when executed read in the different data sources as pandas.DataFrame objects and used those objects when necessary. As the model grew in complexity, it quickly became unreadable and I found myself copy an pasting blocks of code often.
To cleanup the code I decided to make a class that when initialized reads, cleans and parses the data. Initialization takes about a minute to run and builds my model in its entirety.
The class also has some additional functionality. There is a generate_email method that sends an email with details about high risk factors and another method append_history that point-in-times the risk model and saves it so I can run time comparisons.
The thing about these two additional methods is that I cannot imagine a scenario where I would call them without first re-calibrating my risk model. So I have considered calling them in init() like my other methods. I haven't only because I am trying to justify having a class in the first place.
I am consulting this community because my project structure feels clunky and awkward. I am inclined to believe that I should not be using a class at all. Is it frowned upon to create classes merely for the purpose of organization? Also, is it bad practice to call instance methods (that take upwards of a minute to run) within init()?
Ultimately, I am looking for reassurance or a better code structure. Any help would be greatly appreciated.
Here is some pseudo code showing my project structure:
class RiskModel:
def __init__(self, data_path_a, data_path_b):
self.data_path_a = data_path_a
self.data_path_b = data_path_b
self.historical_data = None
self.raw_data = None
self.lookup_table = None
self._read_in_data()
self.risk_breakdown = None
self._generate_risk_breakdown()
self.risk_summary = None
self.generate_risk_summary()
def _read_in_data(self):
# read in a .csv
self.historical_data = pd.read_csv(self.data_path_a)
# read an excel file containing many sheets into an ordered dictionary
self.raw_data = pd.read_excel(self.data_path_b, sheet_name=None)
# store a specific sheet from the excel file that is used by most of
# my class's methods
self.lookup_table = self.raw_data["Lookup"]
def _generate_risk_breakdown(self):
'''
A function that creates a DataFrame from self.historical_data,
self.raw_data, and self.lookup_table and stores it in
self.risk_breakdown
'''
self.risk_breakdown = some_dataframe
def _generate_risk_summary(self):
'''
A function that creates a DataFrame from self.lookup_table and
self.risk_breakdown and stores it in self.risk_summary
'''
self.risk_summary = some_dataframe
def generate_email(self, recipient):
'''
A function that sends an email with details about high risk factors
'''
if __name__ == "__main__":
risk_model = RiskModel(data_path_a, data_path_b)
risk_model.generate_email(recipient#generic.com)
In my opinion it is a good way to organize your project, especially since you mentioned the high rate of re-usability of parts of the code.
One thing though, I wouldn't put the _read_in_data, _generate_risk_breakdown and _generate_risk_summary methods inside __init__, but instead let the user call this methods after initializing the RiskModel class instance.
This way the user would be able to read in data from a different path or only to generate the risk breakdown or summary, without reading in the data once again.
Something like this:
my_risk_model = RiskModel()
my_risk_model.read_in_data(path_a, path_b)
my_risk_model.generate_risk_breakdown(parameters)
my_risk_model.generate_risk_summary(other_parameters)
If there is an issue of user calling these methods in an order which would break the logical chain, you could throw an exception if generate_risk_breakdown or generate_risk_summary are called before read_in_data. Of course you could only move the generate... methods out, leaving the data import inside __init__.
To advocate more on exposing the generate... methods out of __init__, consider a case scenario, where you would like to generate multiple risk summaries, changing various parameters. It would make sense, not to create the RiskModel every time and read the same data, but instead change the input to generate_risk_summary method:
my_risk_model = RiskModel()
my_risk_model.read_in_data(path_a, path_b)
for parameter in [50, 60, 80]:
my_risk_model.generate_risk_summary(parameter)
my_risk_model.generate_email('test#gmail.com')
Suppose i have a list of reference numbers in a file numbers.csv.
I am writing a module checker.py that can be imported and called to check for numbers like this:
import checker
checker.check_number(1234)
checker.py will load a list of numbers and provide check_number to check if a given number is in the list. By default it should load the data from numbers.csv, although the path can be specified:
DATA_FILEPATH = 'numbers.csv'
def load_data(data_filepath=DATA_FILEPATH):
...
REF_LIST = load_data()
def check_number(num, ref_list=REF_LIST):
...
Question -- Interleaving variables and functions seems strange to me. Is there a better way to structure checker.py than the above?
I read the excellent answer to How to create module-wide variables in Python?.
Is the best practice to:
declare REF_LIST list i have done above?
create a dict like VARS = {'REF_LIST': None} and set VARS['REF_LIST'] in load_data?
create a trivial class and assign clas.REF_LIST in load_data?
or else, is it dependent on the situation? (And in what situations do i use which?)
Note
Previously, i avoided this by loading the data only when needed in the calling module. So in checker.py:
DATA_FILEPATH = 'numbers.csv'
def load_data(data_filepath=DATA_FILEPATH):
...
def check_number(num, ref_list):
...
In the calling module:
import checker
ref_list = checker.load_data()
checker.check_number(1234, ref_list)
But it didn't quite make sense for me to load in the calling module, because i would need to load_data 5 times if i want to check numbers in 5 different modules.
You can load csv data easily with help of Pandas framework
import pandas as pd
dataframe=pd.read_csv('numbers.csv')
to check a number is present in the datframe by using this code:
numbers=[1,3,8]
for number in numbers:
if number in dataframe[dataframe.columns[0]]:
print True
else:
print False