Running the same code but with two different datasets (inputs) - python

I have a code in JupyterLab that consists of several functions spread in several cells. The first function generates a dataset that is used in all other functions after it.
What I have to do is run the same code twice but with one of the functions modified. So it would look like this:
data_generating_function() # this function should only be ran once so it generates the same dataset for both trials
function_1() # this is the function that is to be modified once, so there are two version of this function
function_2() # this function and all functions below it stay the same but should be ran twice
function_3()
function_4()
function_5()
So I would run data_generating_function() once and generate the dataset. Then I would run one version of function1() and all the functions below it, then I would run another version of function1() and all the other functions below it.
What would be a good way to implement this? I could obviously duplicate the code and just change some function names, I could also put it all into a single cell and create a for loop. However is there a better way that would ideally preserve multiple cells too?
Thank you

Simply iterate over your two choices for the first function:
data_generating_function()
for func1 in (function1a, function1b):
func1()
function_2()
function_3()
function_4()
function_5()

Apologies if I misunderstood the question, but could you not do the following:
Cell 1:
# define all functions
Cell 2:
dataset = data_generating_function()
Cell 3:
# Run version 1 of function 1 on dataset
result_1_1 = function_1_v1(dataset)
result_2_1 = function_2(result_1_1)
result_3_1 = function_3(result_2_1)
function_4(result_3_1)
Cell 4:
# Run version 2 of function 1 on dataset
result_1_2 = function_1_v2(dataset)
result_2_2 = function_2(result_1_2)
result_3_2 = function_3(result_2_2)
function_4(result_3_2)
This solution assumes that:
you define functions with return values
that passing around the results is not "expensive"
You can also persist the results in a file if the latter is not the case.
To reduce code duplication in function_1, you can add a parameter that switches between the two versions.

You should try to avoid modifying or directly iterating over functions whenever possible. The best thing to do in this case would be to add a boolean parameter to function1 specifying which version of the function you want to run. It would look something like this:
def function1(isFirstTime):
if isFirstTime:
# do stuff the first time
pass
else:
# do stuff the second time
pass
You could then iterate over the functions:
data_generating_function()
for b in (True, False):
function1(b)
function2()
function3()
# ...

Related

Python recursive function stuck in infinite loop despite Try statement

I want to write a simple function that prints out a row of a pandas dataframe. If the dataframe has already been loaded into memory, I don't want to reload it. If it has not been loaded into memory when the function is called for the first time, I want to load it.
I am using a Try statement to test whether the dataframe exists (Testing if a pandas DataFrame exists). The alternative options for doing this produce NameError as the DF has not been initialised to None or empty.
global dfIndices
def getSeriesInfo(seriesCode,verbose=False):
try:
if verbose:
print(dfIndices.loc[seriesCode])
else:
print(dfIndices.loc[seriesCode]['Data Item Description'])
# catch when it hasn't even been defined
except NameError:
print('I am here.')
dfIndices = pd.read_excel(dirAppData+'Indicies.xlsx')
dfIndices.set_index('Series ID',inplace=True)
getSeriesInfo(seriesCode,verbose)
getSeriesInfo('A2304402X')
Assuming I have not already loaded the dfIndicies dataframe, I would expect when first calling the function, that the Try statement would fail, the dataframe would be loaded, the function called again, the Try passed and the function stop exceuting.
Instead, I get an infinite loop.
So that I can learn something, why doesn't this work as expected and what should I do differently?
Thanks

Count iterations of pandas dataframe apply function

I'm looking for a way to count how many times my database function runs.
My code looks like this
df = pd.read_csv("nodos_results3.csv")
df['URL_dashboard'] = df.apply(create_url, axis = 1)
df.to_csv('nodos_results4.csv', index = False)
I want to count how many times the function "create_url" runs. If I was in C++, for example, I would simply have the function take in another input
create_url(database i_DB, int i_count)
{
//stuff I want done to database
i_count++;
}
but I'm not sure how to do something equivalent using pandas dataframe apply.
Update: For anyone who might find this while googling in the future - I personally didn't solve this issue. I was moved to another project and didn't continue working on this. Sorry I can't be more help.
apply executes the function exactly once for each row. So, the function is executed df.shape[0] times. Correction: As per #juanpa.arrivillaga, apply executes the function twice on the first row, the correct answer is df.shape[0]+1.
Alternatively, create a global variable (say, create_url_counter=0) and increment it in the function body:
def create_url(...):
global create_url_counter
...
create_url_counter += 1
Be aware, though, that having global variables is in general a bad idea.

Unexpected Behavior with Pandas Datetime/ dt.date

I found a bug in my code that was due to the date column of my dataframe not including the hours and minutes and only including the date. I traced the cause of the issue due to running these two functions consecutively vs. running them one by one. If I run the functions one by one, there is no problem. If I run them both, my results are unexpected.
I need to run these functions consecutively, but they are not dependent on one another. I'm new to Python, so I thought this might be due to the inputs being overwritten or something (not that that would have happened in Java, as far as I know). So, I changed the functions to be as follows:
def func1(dataset):
originalData = dataset
# only look at one day at a time- remove extra unnecessary info
originalData ['Date'] = pd.to_datetime(originalData ['Date'])
print dataset, 'test1'
originalData ['Date'] = originalData ['Date'].dt.date
print dataset, 'test2'
# other stuff
def func2(dataset):
originalData2 = dataset
# look at entire datetime
originalData2['Date'] = pd.to_datetime(originalData2['Date'])
print originalData2
# other stuff
Run like this, I lose the time in the second function.
csv = pd.read_csv(csvFileName)
func1(csv)
func2(csv)
Run like this, func2 results in my desired output:
csv = pd.read_csv(csvFileName)
func2(csv)
The wierd thing is if run func1, test1 prints out the date with datetime, while test2 prints out only the date. The dataset is being changed even though the changes are applied to originalDataset. Am I misunderstanding something? Thanks in advance.
If you don't want to make changes to the underlying data I'd recommend setting your data inside the function like this: originalData = dataset.copy(). This method provides a deep copy, meaning, you'll only be editing the data within the function and not overriding the underlying object.
Odd behavior, yes.
You may also run into this when taking slices of dataframes and doing transformations on them.

Python lazy evaluation?

Suppose I have the following code:
def my_func(input_line):
is_skip_line = self.is_skip_line(input_line) # parse input line check if skip line
if is_skip_line:
# do something...
# do more ...
if is_skip_line:
# do one last thing
So we have a check for is_skip_line (if is_skip_line:) that appears twice. Does it mean that due to lazy evaluation the method self.is_skip_line(input_line) will be called twice?
If so, what is the best work around, given that self.is_skip_line(input_line) is time consuming? Do I have to "immediately invoke" it, like below?
is_skip_line = (lambda x: self.is_skip_line(x))(input_line)
Thanks.
The misconception here is that this statement is not being immediately invoked:
is_skip_line = self.is_skip_line(input_line)
...when in fact, it is.
The method self.is_skip_line will only ever be invoked once. Since you assign it to a variable, you can use that variable as many times as you like in any context you like.
If you're concerned about the performance of it, then you could use cProfile to really test the performance of the method it's called in with respect to the method it's calling.

numpy.delete() requires me to return a new matrix (name) and that can't work with my code

I have a matrix which is built at the start of the class and then I go on and use within a couple of functions then at the last function I want to delete a row from the matrix without having to use a new variable name for it otherwise I would have to change every single function call (it loops until a winner is found). It sort of works like this
matrix defined
...a lot of functions that use it then winner is called
def winner():
if hasWon():
.....
else:
fun1()
def fun1():
.....
Iterativefun()
def Iterativefun():
.....
matrix = numpy.delete(matrix, obj,axis)
winner()
Is there a way to delete a row? I thought of changing every single number in that row from 1-10 to a 0 which I'm not using so it would be ignored. Any help will be appreciated
I'm not sure what version of Numpy you have (or exactly what you're trying to do for that matter), but if you have Numpy 1.7, numpy.squeeze looks like it might be what you're looking for... http://docs.scipy.org/doc/numpy/reference/generated/numpy.squeeze.html#numpy.squeeze
If changing a row to all zeros is acceptable, you can just use:
matrix[row] = 0

Categories

Resources