Creating an empty DataFrame as a default parameter - python

I am trying to create a python function that plots the data from a DataFrame. The parameters should either be just the data. Or the data and the standard deviation.
As a default parameter for the standard deviation, I want to use an empty DataFrame.
def plot_average(avg_df, stdev=pd.DataFrame()):
if not stdev.empty:
...
...
But implementing it like that gives me the following error message:
TypeError: 'module' object is not callable
How can an empty DataFrame be created as a default parameter?

for a default empty dataframe :
def f1(my_df=None):
if(my_df is None):
my_df = pd.DataFrame()
#stuff to do if it's not empty
if(len(my_df) != 0):
print(my_df)
elif(len(my_df) == 0):
print("Nothing")

A DataFrame is mutable, so a better approach is to default to None and then assign the default value in the function body. See https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments

The problem lies not in the creation of a new DataFrame but in the way the function was called. I use pycharm scientific. In which I had the function call written in a block. Executing this block called the function which was, i presume, not compiled.
Executing the whole programm made it possible to call the function

Related

Retrieve the name of an instance of DataFrame, passed as an argument to the function

I am looking to retrieve the name of an instance of DataFrame, that I pass as an argument to my function, to be able to use this name in the execution of the function.
Example in a script:
display(df_on_step_42)
I would like to retrieve the string "df_on_step_42" to use in the execution of the display function (that display the content of the DataFrame).
As a last resort, I can pass as argument of DataFrame and its name:
display(df_on_step_42, "df_on_step_42")
But I would prefer to do without this second argument.
PySpark DataFrames are non-transformable, so in our data pipeline, we cannot systematically put a name attribute to all the new DataFrames that come from other DataFrames.
You can use the globals() dictionary to search for your variable by matching it using eval.
As #juanpa.arrivillaga mentions, this is fundamentally bad design, but if you need to, here is one way to do this inspired by this old SO answer for python2 -
import pandas as pd
df_on_step_42 = pd.DataFrame()
def get_var_name(var):
for k in globals().keys():
try:
if eval(k) is var:
return k
except:
pass
get_var_name(df_on_step_42)
'df_on_step_42'
Your display would then look like -
display(df_on_step_42, get_var_name(df_on_step_42))
Caution
This will fail for views of variables since they are just pointing to the memory of the original variable. This means that the original variable occurs first in the global dictionary during an iteration of the keys, it will return the name of the original variable.
a = 123
b = a
get_var_name(b)
'a'
I finally found a solution to my problem using the inspect and re libraries.
I use the following lines which correspond to the use of the display() function
import inspect
import again
def display(df):
frame = inspect.getouterframes(inspect.currentframe())[1]
name = re.match("\s*(\S*).display", frame.code_context[0])[1]
print(name)
display(df_on_step_42)
The inspect library allows me to get the call context of the function, in this context, the code_context attribute gives me the text of the line where the function is called, and finally the regex library allows me to isolate the name of the dataframe given as parameter.
It’s not optimal but it works.

How can I get a repr of a DataFrame that is a valid Python expression?

I would like to unit test some code that returns a pd.DataFrame. The code already returns the correct value, and I'd like to use pd.testing.assert_frame_equal to assert that it returns this value so that if I change the code later I can catch regressions.
Many types in Python have a .__repr__() method that returns a valid Python expression that can be used to reconstruct the object. When I want to write this sort of test for one of these types, I can run
actual = my_func()
print(repr(actual))
and then the test is
def my_test():
actual = my_func()
expected = <what was printed>
assert actual == expected
However, DataFrame's __repr__() method does not return a Python expression. Is there a way that I print out some Python code that reconstructs the dataframe?
Using DataFrame.to_dict() and then passing that to DataFrame.__init__ is close:
def my_test():
actual = my_func
expected = DataFrame(<output of print(actual.to_dict())>)
pd.testing.assert_frame_equal(actual, expected)
But this falls apart when actual is empty, because the column types can't be inferred, and this is the case I'm trying to test.

How to pass dataframe columns as a input for R function calling from python?

I have one function in R. I want to use that function in python.But I am getting error while passing arguments.
Calling R function from python referenced from calling a R function from python code with passing arguments this site.
I have tried like below with my exisiting function
path=""
actual_values= data['col1']
predicted_values = data['col2']
def function1(input1,output):
r=ro.r
r.source(path+"script.R")
p=r.metrics_fun(input1,output) # function name in the R is metrics_fun
return p
output_df = function1(actual_values,predicted_values)
I am getting error like below
NotImplementedError: Conversion 'py2rpy' not defined for objects of type '<class 'pandas.core.series.Series'>'
Please, help me to find solution for this.

how to pass the parameters to apply correctly

There is an api function
get_next_trading_date(exchange='SZSE', date='2017-05-01')
and I have a DataFrame backTestRecordAfterModified showed as follow
when I run
backTestRecordAfterModified['createdAt']=backTestRecordAfterModified['createdAt'].apply(func=get_next_trading_date, exchange='SZSE')
the console displayed the message : TypeError: get_next_trading_date() got multiple values for argument 'exchange'
so,how to pass the parameters correctly
supplementary
backTestRecordAfterModified['createdAt'] = backTestRecordAfterModified['createdAt'].apply(lambda date: get_next_trading_date(date, exchange='SZSE'))
the code above still displays the same error.
i add the definition of get_next_trading_date
I got the final answer just now.
backTestRecordAfterModified['createdAt']=backTestRecordAfterModified['createdAt'].apply(lambda date: get_next_trading_date(date=date,exchange='SZSE'))
You'll have to use a lambda-function to pass the additional parameter to the get_next_trading_date() function:
backTestRecordAfterModified['createdAt']=backTestRecordAfterModified['createdAt'].apply(lambda date: get_next_trading_date(date=date, exchange='SZSE'))
The pandas.Series.apply() function does in fact support additional keyword arguments for the function, but the first argument to the function is always the value from the pandas series.
If get_next_trading_date() was defined differently, with the order of arguments reversed:
get_next_trading_date_2(date='2017-05-01', exchange='SZSE')
you could have used
backTestRecordAfterModified['createdAt']=backTestRecordAfterModified['createdAt'].apply(func=get_next_trading_date, exchange='SZSE').
The apply function is called for each value of the pandas series, and the value is passed as an argument to the function by default.
The additional arguments you specify are passed after the series value.
So in your example, each time the function call would be like
get_next_trading_date(<i-th value of the series>, exchange='SZSE')
But in your function, the first argument is for exchange, so the <i-th value of the series> (current date) is passed to exchange and then there's another keyword argument trying to set the same variable. This causes the error. More in here.
Here you have two options.
a) Change the function definition to take date as first argument, so that you don't have to change your function call. But be sure to make the change everywhere you call this function.
get_next_trading_date(date='2017-05-01', exchange='SZSE')
b) Change your function call to pass date as second argument.
backTestRecordAfterModified['createdAt'] = backTestRecordAfterModified['createdAt'].apply(lambda date: get_next_trading_date(date, exchange='SZSE'))
Or simplified as,
backTestRecordAfterModified['createdAt'].apply(lambda date: get_next_trading_date(date, exchange='SZSE'), inplace=True)
One option is to use df.apply instead of series.apply:
df['createdAt'] = df.apply(lambda row: get_date(row['createdAt'], 'SZSE'), axis=1)
Or, if you don't want to pass the whole dataframe:
df['createdAt'] = [get_date(x, 'SZSE') for x in df['createdAt'].values]

passing the values to a function

I have a function that returns the replaced values correctly, but for some reason, the run_instances function receives the entire string as a single object (instead of 4 separate values).
import boto
ec2_conn = boto.connect_ec2(aws_access_key_id='XXX', aws_secret_access_key='XXX')
ami='ami-XXX'
key_name='XXX15a.pem'
instance_type='t1.macro'
aid="image_id='%s', placement='us-east-1a', key_name='%s', instance_type='%s'" % (ami, key_name, instance_type)
When I try to execute the run_instances function...
ec2_conn.run_instances(aid)
<Message>Invalid id: "image_id='ami-XXX', placement='us-east-1a', key_name='XXX.pem', instance_type='t1.macro'" (expecting "ami-...")</Message>
Is there any way to pass the values to the function correctly?
Simplifying the problem statement to: how to pass multiple variables around so they can be passed into a function later on without passing all variables individually...:
params = dict(ami='ami-XXX', key_name='XXX15a.pem', instance_type='t1.macro', placement='us-east-1a')
ec2_conn.run_instances(**params)
Store them in a dict and expand them to keyword arguments with **.

Categories

Resources