Dask's strange behaviour when adding a new column with random uuids - python

Fairly new to Dask but just wondering why it is behaving in such strange way. Essentially, I create a new column with random uuids and join it to another dask dataframe. For some odd reason the uuids keep changing and not sure if I am missing something?
This is a representation of my code:
def generate_uuid() -> str:
""" generates uuid4 id """
return str(uuid4())
my_dask_data = dd.from_pandas(my_pandas_data, npartitions=4)
my_dask_data["uuid"] = None
my_dask_data["uuid"] = my_dask_data.apply(generate_uuid, axis=1, meta=("uuid"), "str"))
print(my_dask_data.compute())
And this is the output:
name uuid
my_name_1 16fb858c-bbed-413b-a415-62099ee2c455
my_name_2 9acd0a22-9b19-4db6-9759-b70dc0353710
my_name_3 5d610aaf-a813-4d0b-8d83-8f11fe400c7e
Then, I do a concat with other dask dataframe:
joined_data = dd.concat([my_dask_data, my_other_dask_data], axis=1)
print(joined_data.compute())
This is the output, which for some reason it produces new uuids:
name uuid tests
my_name_1 f951cefa-1145-411c-96f6-924730d7cb22 test1
my_name_2 88e28e5f-42ea-4fbe-a036-b8179a0ba3f8 test2
my_name_3 50e70fac-da19-4d2f-b6ea-80da41591ac5 test3
Any thoughts on how to keep the same uuids without changing?

Dask does not keep your data in memory, by design - this is a huge attractive feature of dask. So every time you compute, your function will be executed again. Since uuid4() is based on a random number generator, different results each time are expected. In fact, UUIDs are never supposed to repeat.
The question is, what would you like to happen, what is your actual workflow? You might be interested in reading this SO question: How to generate a random UUID which is reproducible (with a seed) in Python

Related

Avoiding Unnecessary Class Declarations

I'm doing a ML project and decided to use classes to organize my code. Although, I'm not sure if my approach is optimal. I'll appreciate if you can share best practices, how you would approach similar challenge:
Lets concentrate on preprocessing module, where I created Preprocessor class.
This class has 3 methods for data manipulation, each taking a dataframe as input and adding a feature. Output of each method can be an input of another.
I also have 4th, wrapper method, that takes these 3 methods, chains them and creates final output:
def wrapper(self):
output = self.method_1(self.df)
output = self.method_2(output)
output = self.method_3(output)
return output
When I want to use the class, I'm creating instance with df and just call wrapper function from it. Which feels unnatural and makes me think there is a better way of doing it.
import A_class
instance = A_class(df)
output = instance.wrapper()
Classes are great if you need to keep track of/modify internal state of an object. But they're not magical things that keep your code organized just by existing. If all you have is a preprocessing pipeline that takes some data and runs it through methods in a straight line, regular functions will often be less cumbersome.
With the context you've given I'd probably do something like this:
pipelines.py
def preprocess_data_xyz(data):
"""
Takes a dataframe of nature XYZ and returns it after
running it through the necessary preprocessing steps.
"""
step_1 = func_1(data)
step_2 = func_2(step_1)
step_3 = func_3(step_2)
return step_3
def func_1(data):
"""Does X to data."""
pass
# etc ...
analysis.py
import pandas as pd
from pipelines import preprocess_data_xyz
data_xyz = pd.DataFrame( ... )
preprocessed_data_xyz = preprocess_data_xyz(data=data_xyz)
Choosing better variable and functions is also a major component of organizing your code - you should replace func_1, with a name that describes what it does to the data (something like add_numerical_column, parse_datetime_column, etc). Likewise for the data_xyz variable.

Count iterations of pandas dataframe apply function

I'm looking for a way to count how many times my database function runs.
My code looks like this
df = pd.read_csv("nodos_results3.csv")
df['URL_dashboard'] = df.apply(create_url, axis = 1)
df.to_csv('nodos_results4.csv', index = False)
I want to count how many times the function "create_url" runs. If I was in C++, for example, I would simply have the function take in another input
create_url(database i_DB, int i_count)
{
//stuff I want done to database
i_count++;
}
but I'm not sure how to do something equivalent using pandas dataframe apply.
Update: For anyone who might find this while googling in the future - I personally didn't solve this issue. I was moved to another project and didn't continue working on this. Sorry I can't be more help.
apply executes the function exactly once for each row. So, the function is executed df.shape[0] times. Correction: As per #juanpa.arrivillaga, apply executes the function twice on the first row, the correct answer is df.shape[0]+1.
Alternatively, create a global variable (say, create_url_counter=0) and increment it in the function body:
def create_url(...):
global create_url_counter
...
create_url_counter += 1
Be aware, though, that having global variables is in general a bad idea.

Possible and/or wise to assign a method to a variable?

I'm repeatedly trying to get similar data from time series dataframes. For example, the monthly (annualized) standard deviation of the series:
any_df.resample('M').std().mean()*12**(1/2)
It would save typing and would probably limit errors if these methods could be assigned to a variable, so they can be re-used - I guess this would look something like
my_stdev = .resample('M').std().mean()*12**(1/2)
result = any_df.my_stdev()
Is this possible and if so is it sensible?
Thanks in advance!
Why not just make your own function?
def my_stdev(df):
return df.resample('M').std().mean()*12**(1/2)
result = my_stdev(any_df)

Unexpected Behavior with Pandas Datetime/ dt.date

I found a bug in my code that was due to the date column of my dataframe not including the hours and minutes and only including the date. I traced the cause of the issue due to running these two functions consecutively vs. running them one by one. If I run the functions one by one, there is no problem. If I run them both, my results are unexpected.
I need to run these functions consecutively, but they are not dependent on one another. I'm new to Python, so I thought this might be due to the inputs being overwritten or something (not that that would have happened in Java, as far as I know). So, I changed the functions to be as follows:
def func1(dataset):
originalData = dataset
# only look at one day at a time- remove extra unnecessary info
originalData ['Date'] = pd.to_datetime(originalData ['Date'])
print dataset, 'test1'
originalData ['Date'] = originalData ['Date'].dt.date
print dataset, 'test2'
# other stuff
def func2(dataset):
originalData2 = dataset
# look at entire datetime
originalData2['Date'] = pd.to_datetime(originalData2['Date'])
print originalData2
# other stuff
Run like this, I lose the time in the second function.
csv = pd.read_csv(csvFileName)
func1(csv)
func2(csv)
Run like this, func2 results in my desired output:
csv = pd.read_csv(csvFileName)
func2(csv)
The wierd thing is if run func1, test1 prints out the date with datetime, while test2 prints out only the date. The dataset is being changed even though the changes are applied to originalDataset. Am I misunderstanding something? Thanks in advance.
If you don't want to make changes to the underlying data I'd recommend setting your data inside the function like this: originalData = dataset.copy(). This method provides a deep copy, meaning, you'll only be editing the data within the function and not overriding the underlying object.
Odd behavior, yes.
You may also run into this when taking slices of dataframes and doing transformations on them.

Access Shared DataFrame in Multiprocessing Map

I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.
Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.
You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)

Categories

Resources