Please be patient I am new to Python and Pandas.
I have a lot of pandas dataframe, but some are duplicates. So I wrote a function that check if 2 dataframes are equal, if they are 1 will be deleted:
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print( "Deleted %s" (df_name) )
The function works, but I wish to know how to have the variable "df_name" as string with the name of the dataframe.
I don't understand, the parameters df1 and df2 are dataframe objects how I can get their name at run-time if I wish to print it?
Thanks in advance.
What you are trying to use is an f-string.
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print(f"Deleted {df2.name}")
I'm not certain whether you can call this print method, though. Since you deleted the dataframe right before you call its name attribute. So df2 is unbound.
Instead try this:
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
del df2
Now, do note that your usage of 'del' is also not correct. I assume you want to delete the second dataframe in your code. However, you only delete it inside the scope of the check_eq method. You should familiarize yourself with the scope concept first. https://www.w3schools.com/python/python_scope.asp
The code I used:
d = {'col1': [1, 2], 'col2': [3, 4]}
df1 = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d)
df1.name='dataframe1'
df2.name='dataframe2'
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
Related
I am trying to create a function to perform operations with columns of a data frame, but in the end it gives me an error because when defining the lambda x: x.variable, variable takes it literally, how do I assign the value it has? variable in the x.
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def example(dataFrame, variable):
dataFrame= dataFrame.assign(newColumn= lambda x: x.variable**2)
example(df,'col1')
AttributeError: 'DataFrame' object has no attribute 'variable'
How can i fix this?
Changed the x.variable to x[variable] inside your lambda
def example(dataFrame, variable):
dataFrame = dataFrame.assign(newColumn = lambda x: x[variable]**2)
return dataFrame # you probably want to return the dataFrame as well
example(df,'col1')
df.column_name (x.variable) can be used only if the column already exists, in your case you are creating a new column and the property you are calling does not exist on that dataframe yet. So you can use df['new column name'] (x[variable])
Let's assume we have the following Pandas dataframe df:
df = pd.DataFrame({'food' : ['spam', 'ham', 'eggs', 'ham', 'ham', 'eggs', 'milk'],
'sales' : [10, 15, 12, 5, 14, 3, 8]})
Let's further assume that we have the following function that squares the value of the sales column in df:
def square_sales(df):
df['sales'] = df['sales']**2
return df
Now, let's assume we have a requirement to: "return df to the caller"
Does this mean that we pass a df to the square_sales function, then return the processed df (i.e. the df with the squared sales column?
Or, does this mean that we pass df to square_sales, then assign that function call to a variable named df? For example:
df = square_sales(df)
Thanks!
The function changes the df itself (inplace operation). Even if you don't return the df, it will change in the calling scope as well.
The way it is written will work the same for both cases:
df = square_sales(df)
and
square_sales(df)
If you need to return a new df w/o altering the original you'll have to first make a copy and only then assign the new column. In this case you will also have to return the new df to a new variable:
def square_sales(df):
df2 = df.copy(deep=True)
df2['sales'] = df2['sales']**2
return df2
new_df = square_sales(df)
I think there's some aspect of functions and variable scope that you're confused about, but I'm not sure precisely what. If the function returns a DataFrame, then outside of the function you can assign that returned DataFrame to whatever variable you want. Whether or not the variable name outside the function is the same as the variable name inside the function doesn't matter, as far as the function is concerned.
SiP's answer already points out that your function modifies the original input DataFrame in place and returns the updated version. I would caution that this is a misleading antipattern. Functions that operate on a mutable value (like a DataFrame) are usually expected to only do one or the other. And Pandas' own methods, by default, return the new value without modifying in placeāas it appears you've been asked to do.
So I would advise that you use the modified function suggested by SiP, that copies the supplied DataFrame before making changes. As for using it, all of these do basically the same thing:
# Define df
df = square_sales(df)
# Define df
new_df = square_sales(df)
# Define df
some_other_variable_name = square_sales(df)
The only real difference is that in the first case, you no longer have access to the previous, unmodified DataFrame. But if you don't need that, and from henceforth you only plan to need the squared version, then it can make perfect sense.
(Also, if you wanted to, you could alter the function definition to use a different parameter name, say my_internal_df. This would not in any way affect how any of those three examples work.)
I'm wondering if there's any benefit to writing this pattern
def feature_eng(df):
df1 = df.copy()
...
return df1
as opposed to this pattern
def feature_eng(df):
...
return df
Say you have a raw dataframe df_raw and you create df_feature using feature_eng. Your second method will overwrite df_raw when calling df_feature = feature_eng(df_raw) while the first method will not. So in case you want to keep df_raw as it is and not modify it, the first pattern will lead to the correct result.
A little example:
def feature_eng1(df):
df.drop(columns=['INDEX'], inplace=True)
return df
def feature_eng2(df):
df1 = df.copy()
df1.drop(columns=['INDEX'], inplace=True)
return df1
df_feature = feature_eng1(df_raw)
Here df_raw will not contain the contain the column INDEX while using feature_eng2 it would.
I have quite a difficult issue to explain, I'll try my best.
I have a function a() that calls function b() and passes to b() a dataframe (called "df_a").
I learned that this is done by reference, meaning that when/if in inside function b() I add a new column to the input dataframe, this will also modify the original one. For example:
def b(df_b):
df_b['Country'] = "not sure"
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
b(df_a)
print(df_a) # this dataframe will now have the column "Country"
So far so good. The problem is that today I realized that if inside b() we merge the dataframe with another dataframe, this create a new local dataframe.
def b(df_b):
df_c = pd.DataFrame({"Name":['Mark','Annie'], 'Country':['Brazil','Japan']})
df_b = pd.merge(df_b, df_c, left_on = 'Name', right_on='Name', how='left')
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
b(df_a)
print(df_a) # this dataframe will *not* have the column "Country"
So my question is, how to I make sure in this second example the column "Country" is also assigned to the original df_a dataframe, without returning it back?
(I would prefer not to use "return df_b" inside function b() since I would have to change the logic in many many parts of the code.
Thank you
I have modified the function b() and a() so the changes made in b are returned back to a
def b(df_b):
df_c = pd.DataFrame({"Name":['Mark','Annie'], 'Country':['Brazil','Japan']})
df_b = pd.merge(df_b, df_c, left_on = 'Name', right_on='Name', how='left')
return df_b
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
df_a = b(df_a)
print(df_a)
**Output:** a()
Name Age Country
0 Mark 30 Brazil
1 Annie 28 Japan
I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().