Unexpected Behavior with Pandas Datetime/ dt.date - python

I found a bug in my code that was due to the date column of my dataframe not including the hours and minutes and only including the date. I traced the cause of the issue due to running these two functions consecutively vs. running them one by one. If I run the functions one by one, there is no problem. If I run them both, my results are unexpected.
I need to run these functions consecutively, but they are not dependent on one another. I'm new to Python, so I thought this might be due to the inputs being overwritten or something (not that that would have happened in Java, as far as I know). So, I changed the functions to be as follows:
def func1(dataset):
originalData = dataset
# only look at one day at a time- remove extra unnecessary info
originalData ['Date'] = pd.to_datetime(originalData ['Date'])
print dataset, 'test1'
originalData ['Date'] = originalData ['Date'].dt.date
print dataset, 'test2'
# other stuff
def func2(dataset):
originalData2 = dataset
# look at entire datetime
originalData2['Date'] = pd.to_datetime(originalData2['Date'])
print originalData2
# other stuff
Run like this, I lose the time in the second function.
csv = pd.read_csv(csvFileName)
func1(csv)
func2(csv)
Run like this, func2 results in my desired output:
csv = pd.read_csv(csvFileName)
func2(csv)
The wierd thing is if run func1, test1 prints out the date with datetime, while test2 prints out only the date. The dataset is being changed even though the changes are applied to originalDataset. Am I misunderstanding something? Thanks in advance.

If you don't want to make changes to the underlying data I'd recommend setting your data inside the function like this: originalData = dataset.copy(). This method provides a deep copy, meaning, you'll only be editing the data within the function and not overriding the underlying object.
Odd behavior, yes.
You may also run into this when taking slices of dataframes and doing transformations on them.

Related

Dask's strange behaviour when adding a new column with random uuids

Fairly new to Dask but just wondering why it is behaving in such strange way. Essentially, I create a new column with random uuids and join it to another dask dataframe. For some odd reason the uuids keep changing and not sure if I am missing something?
This is a representation of my code:
def generate_uuid() -> str:
""" generates uuid4 id """
return str(uuid4())
my_dask_data = dd.from_pandas(my_pandas_data, npartitions=4)
my_dask_data["uuid"] = None
my_dask_data["uuid"] = my_dask_data.apply(generate_uuid, axis=1, meta=("uuid"), "str"))
print(my_dask_data.compute())
And this is the output:
name uuid
my_name_1 16fb858c-bbed-413b-a415-62099ee2c455
my_name_2 9acd0a22-9b19-4db6-9759-b70dc0353710
my_name_3 5d610aaf-a813-4d0b-8d83-8f11fe400c7e
Then, I do a concat with other dask dataframe:
joined_data = dd.concat([my_dask_data, my_other_dask_data], axis=1)
print(joined_data.compute())
This is the output, which for some reason it produces new uuids:
name uuid tests
my_name_1 f951cefa-1145-411c-96f6-924730d7cb22 test1
my_name_2 88e28e5f-42ea-4fbe-a036-b8179a0ba3f8 test2
my_name_3 50e70fac-da19-4d2f-b6ea-80da41591ac5 test3
Any thoughts on how to keep the same uuids without changing?
Dask does not keep your data in memory, by design - this is a huge attractive feature of dask. So every time you compute, your function will be executed again. Since uuid4() is based on a random number generator, different results each time are expected. In fact, UUIDs are never supposed to repeat.
The question is, what would you like to happen, what is your actual workflow? You might be interested in reading this SO question: How to generate a random UUID which is reproducible (with a seed) in Python

Running the same code but with two different datasets (inputs)

I have a code in JupyterLab that consists of several functions spread in several cells. The first function generates a dataset that is used in all other functions after it.
What I have to do is run the same code twice but with one of the functions modified. So it would look like this:
data_generating_function() # this function should only be ran once so it generates the same dataset for both trials
function_1() # this is the function that is to be modified once, so there are two version of this function
function_2() # this function and all functions below it stay the same but should be ran twice
function_3()
function_4()
function_5()
So I would run data_generating_function() once and generate the dataset. Then I would run one version of function1() and all the functions below it, then I would run another version of function1() and all the other functions below it.
What would be a good way to implement this? I could obviously duplicate the code and just change some function names, I could also put it all into a single cell and create a for loop. However is there a better way that would ideally preserve multiple cells too?
Thank you
Simply iterate over your two choices for the first function:
data_generating_function()
for func1 in (function1a, function1b):
func1()
function_2()
function_3()
function_4()
function_5()
Apologies if I misunderstood the question, but could you not do the following:
Cell 1:
# define all functions
Cell 2:
dataset = data_generating_function()
Cell 3:
# Run version 1 of function 1 on dataset
result_1_1 = function_1_v1(dataset)
result_2_1 = function_2(result_1_1)
result_3_1 = function_3(result_2_1)
function_4(result_3_1)
Cell 4:
# Run version 2 of function 1 on dataset
result_1_2 = function_1_v2(dataset)
result_2_2 = function_2(result_1_2)
result_3_2 = function_3(result_2_2)
function_4(result_3_2)
This solution assumes that:
you define functions with return values
that passing around the results is not "expensive"
You can also persist the results in a file if the latter is not the case.
To reduce code duplication in function_1, you can add a parameter that switches between the two versions.
You should try to avoid modifying or directly iterating over functions whenever possible. The best thing to do in this case would be to add a boolean parameter to function1 specifying which version of the function you want to run. It would look something like this:
def function1(isFirstTime):
if isFirstTime:
# do stuff the first time
pass
else:
# do stuff the second time
pass
You could then iterate over the functions:
data_generating_function()
for b in (True, False):
function1(b)
function2()
function3()
# ...

Does data = batch['data'].cuda().function().cpu() make sense?

I have a dataset, which I call with batch['data'] and get my image output MxM. After I get my image I want to process it with some numpy operations. In this process I want my dataset to give me the image with GPU and changing the outputs device to CPU after that.
My question is, is concetanation of functions in Python being executed in an order? And can I make this process with
base = batch['data'].cuda().function().cpu()
And is this the same as:
base = batch['data'].cuda().function()
base.cpu()
Thanks in advance!
Well, the CPU(s) will do the same work, but the result is not the same.
base = batch['data'].cuda().cpu()
After that line, you have the output of cpu() stored in the variable called base.
base = batch['data'].cuda()
base.cpu()
After these two lines, you have the output of cuda() stored in the variable called base and you have forgotten the result of cpu().
is concatenation of functions in Python being executed in an order?
Yes, of course: the first method returns some object, and the next one is called on that returned object.
No, these pieces of code are not the same:
The first one assigns the return value of cpu to base
The second one throws this value away
Also, if you need the object returned by batch['data'].cuda(), then the first code will call cpu on it and potentially throw it away afterwards. The second one saves that object but gets rid of the result of calling cpu, which may not be desirable
Same thing is with writing batch['data'].cuda() or tmp = batch['data']; base = tmp.cuda(): batch['data'] returns some object, and then .cuda can be called on that object.
As long as functions return object that have the methods you want to call, you can chain as many methods as you want to: thing().a().b().c().d()

Count iterations of pandas dataframe apply function

I'm looking for a way to count how many times my database function runs.
My code looks like this
df = pd.read_csv("nodos_results3.csv")
df['URL_dashboard'] = df.apply(create_url, axis = 1)
df.to_csv('nodos_results4.csv', index = False)
I want to count how many times the function "create_url" runs. If I was in C++, for example, I would simply have the function take in another input
create_url(database i_DB, int i_count)
{
//stuff I want done to database
i_count++;
}
but I'm not sure how to do something equivalent using pandas dataframe apply.
Update: For anyone who might find this while googling in the future - I personally didn't solve this issue. I was moved to another project and didn't continue working on this. Sorry I can't be more help.
apply executes the function exactly once for each row. So, the function is executed df.shape[0] times. Correction: As per #juanpa.arrivillaga, apply executes the function twice on the first row, the correct answer is df.shape[0]+1.
Alternatively, create a global variable (say, create_url_counter=0) and increment it in the function body:
def create_url(...):
global create_url_counter
...
create_url_counter += 1
Be aware, though, that having global variables is in general a bad idea.

Python lazy evaluation?

Suppose I have the following code:
def my_func(input_line):
is_skip_line = self.is_skip_line(input_line) # parse input line check if skip line
if is_skip_line:
# do something...
# do more ...
if is_skip_line:
# do one last thing
So we have a check for is_skip_line (if is_skip_line:) that appears twice. Does it mean that due to lazy evaluation the method self.is_skip_line(input_line) will be called twice?
If so, what is the best work around, given that self.is_skip_line(input_line) is time consuming? Do I have to "immediately invoke" it, like below?
is_skip_line = (lambda x: self.is_skip_line(x))(input_line)
Thanks.
The misconception here is that this statement is not being immediately invoked:
is_skip_line = self.is_skip_line(input_line)
...when in fact, it is.
The method self.is_skip_line will only ever be invoked once. Since you assign it to a variable, you can use that variable as many times as you like in any context you like.
If you're concerned about the performance of it, then you could use cProfile to really test the performance of the method it's called in with respect to the method it's calling.

Categories

Resources