Please be patient I am new to Python and Pandas.
I have a lot of pandas dataframe, but some are duplicates. So I wrote a function that check if 2 dataframes are equal, if they are 1 will be deleted:
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print( "Deleted %s" (df_name) )
The function works, but I wish to know how to have the variable "df_name" as string with the name of the dataframe.
I don't understand, the parameters df1 and df2 are dataframe objects how I can get their name at run-time if I wish to print it?
Thanks in advance.
What you are trying to use is an f-string.
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print(f"Deleted {df2.name}")
I'm not certain whether you can call this print method, though. Since you deleted the dataframe right before you call its name attribute. So df2 is unbound.
Instead try this:
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
del df2
Now, do note that your usage of 'del' is also not correct. I assume you want to delete the second dataframe in your code. However, you only delete it inside the scope of the check_eq method. You should familiarize yourself with the scope concept first. https://www.w3schools.com/python/python_scope.asp
The code I used:
d = {'col1': [1, 2], 'col2': [3, 4]}
df1 = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d)
df1.name='dataframe1'
df2.name='dataframe2'
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
I have quite a difficult issue to explain, I'll try my best.
I have a function a() that calls function b() and passes to b() a dataframe (called "df_a").
I learned that this is done by reference, meaning that when/if in inside function b() I add a new column to the input dataframe, this will also modify the original one. For example:
def b(df_b):
df_b['Country'] = "not sure"
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
b(df_a)
print(df_a) # this dataframe will now have the column "Country"
So far so good. The problem is that today I realized that if inside b() we merge the dataframe with another dataframe, this create a new local dataframe.
def b(df_b):
df_c = pd.DataFrame({"Name":['Mark','Annie'], 'Country':['Brazil','Japan']})
df_b = pd.merge(df_b, df_c, left_on = 'Name', right_on='Name', how='left')
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
b(df_a)
print(df_a) # this dataframe will *not* have the column "Country"
So my question is, how to I make sure in this second example the column "Country" is also assigned to the original df_a dataframe, without returning it back?
(I would prefer not to use "return df_b" inside function b() since I would have to change the logic in many many parts of the code.
Thank you
I have modified the function b() and a() so the changes made in b are returned back to a
def b(df_b):
df_c = pd.DataFrame({"Name":['Mark','Annie'], 'Country':['Brazil','Japan']})
df_b = pd.merge(df_b, df_c, left_on = 'Name', right_on='Name', how='left')
return df_b
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
df_a = b(df_a)
print(df_a)
**Output:** a()
Name Age Country
0 Mark 30 Brazil
1 Annie 28 Japan
Based on a recent question I came to wonder what exactly goes wrong when sorting a group using inplace=True inside a function applied to groupby.
Example and problem
df = pd.DataFrame({'A': ['a', 'a', 'b'],
'B': [3, 2, 1]})
def func(x):
x.sort_values('B', inplace=True)
return x.B.max()
dfg = df.groupby('A')
dfg.apply(func)
This gives
A
a 3
b 3
while one would expect
A
a 3
b 1
Printing x inside the function shows that the function func is applied to the group 'a' during each call (the group 'b' is "replaced" entirely):
def func(x):
print(x)
x.sort_values('B', inplace=True)
return x.B.max()
# Output (including the usual pandas apply zero-call)
A B
0 a 3
1 a 2
A B
0 a 3
1 a 2
A B
1 a 2
0 a 3
Solution to the problem
This issue can be fixed by performing the sort inside func like x = x.sort_values('B'). In this case, everything works as expected.
Question
Now to my conceptual problem: As a first thought I would expect
that inplace modifies the DataFrame/DataFrameGroupBy itself, while the assignment x = x.sort_values('B') creates a copy
that this is the groupby equivalent of modifying a list while looping over it
However, inspection of both the Dataframe df and the DataFrameGroupBy instance dfg reveals that they are unchanged after the apply, which suggests that the problem is not the modification of the original instances. So what is going on here?
When I did
def func(x):
x = x.copy()
x.sort_values('B', inplace=True)
return x.B.max()
It returns
A
a 3
b 1
so it verifies your first thought
i.e.
that inplace modifies the DataFrame/DataFrameGroupBy itself, while
the assignment x = x.sort_values('B') creates a copy
I iterated over dfg groupby object as well.
def func(x):
x = x.sort_values('B', inplace=True)
return x.B.max()
dfg = df.groupby('A')
for x in dfg:
print(func(x[1]))
It returns
3
1
Hence from my understanding, this issue is something to do with how DataFrame.groupby().apply() iterates over its elements.
It just assigns same memory block to all it's elements and once you overwrite that block by using inplace=True, it gets updated permanently.
Hence your dfg and df variables still have original values but you're still getting the wrong output.
Suppose I have n number of data frames df_1, df_2, df_3, ... df_n, containing respectively columns named SPEED1 ,SPEED2, SPEED3, ..., SPEEDn, for instance:
import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(0,600,100)})
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(0,600,100)})
and I want to make the same changes to all of the data frames. How do I do so by defining a function on similar lines?
def modify(df,nr):
df_invalid_nr=df_nr[df_nr['SPEED'+str(nr)]>500]
df_valid_nr=~df_invalid_nr
Invalid_cycles_nr=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_nr)
print(df)
So, when I try to run the above function
modify(df_1,1)
It returns the entire data frame without modification and the invalid cycles as an empty array. I am guessing I need to define the modification on the global dataframe somewhere in the function for this to work.
I am also not sure if I could do this another way, say just looping an iterator through all the data frames. But, I am not sure it will work.
for i in range(1,n+1):
df_invalid_i=df_i[df_i['SPEED'+str(i)]>500]
df_valid_i=~df_invalid_i
Invalid_cycles_i=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_i)
print(df)
How do I, in general, access df_1 using an iterator? It seems to be a problem.
Any help would be appreciated, thanks!
Solution
Inputs
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(1,600,100))
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(1,600,100))
Code
To my mind a better approach would be to store your dfs into a list and enumerate over it for augmenting informations into your dfs to create a valid column:
for idx, df in enumerate([df_1, df_2]):
col = 'SPEED'+str(idx+1)
df['valid'] = df[col] <= 500
print(df_1)
SPEED valid
0 516.395756 False
1 14.643694 True
2 478.085372 True
3 592.831029 False
4 1.431332 True
You can then filter for valid or invalid with df_1[df_1.valid] or df_1[df_1.valid == False]
It is a solution to fit your problem, see Another solution that may be more clean and Notes below for explanations you need.
Another (better?) solution
If it is possible for you re-think your code. Each DataFrame has one column speed, then name it SPEED:
dfs = dict(df_1=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}),
df_2=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}))
It will allow you to do the following one liner:
dfs = dict(map(lambda key_val: (key_val[0],
key_val[1].assign(valid = key_val[1]['SPEED'] <= 500)),
dfs.items()))
print(dfs['df_1'])
SPEED valid
0 516.395756 False
1 14.643694 True
2 478.085372 True
3 592.831029 False
4 1.431332 True
Explanations:
dfs.items() returns a list of key (i.e. names) and values (i.e. DataFrames)
map(foo, bar) apply the function foo (see this answer, and DataFrame assign) to all the elements of bar (i.e. to all the key/value pairs of dfs.items().
dict() cast the map to a dict.
Notes
About modify
Notice that your function modify is not returning anything... I suggest you to have more readings on mutability and immutability in Python. This article is interesting.
You can then test the following for instance:
def modify(df):
df=df[df.SPEED1<0.5]
#The change in df is on the scope of the function only,
#it will not modify your input, return the df...
return df
#... and affect the output to apply changes
df_1 = modify(df_1)
About access df_1 using an iterator
Notice that when you do:
for i in range(1,n+1):
df_i something
df_i in your loop will call the object df_i for each iteration (and not df_1 etc.)
To call an object by its name, use globals()['df_'+str(i)] instead (Assuming that df_1 to df_n+1 are located in globals()) - from this answer.
To my mind it is not a clean approach. I don't know how do you create your DataFrames but if it is possible for your I will suggest you to store them into a dictionary instead affecting manually:
dfs = {}
dfs['df_1'] = ...
or a bit more automatically if df_1 to df_n already exist - according to first part of vestland answer :
dfs = dict((var, eval(var)) for
var in dir() if
isinstance(eval(var), pd.core.frame.DataFrame) and 'df_' in var)
Then it would be easier for your to iterate over your DataFrames:
for i in range(1,n+1):
dfs['df_'+str(i)'] something
You can use the globals() function which allows you to get a variable by his name.
I just add df_i = globals()["df_"+str(i)] at the begining of the for loop :
for i in range(1,n+1):
df_i = globals()["df_"+str(i)]
df_invalid_i=df_i.loc[df_i['SPEED'+str(i)]>500]
df_valid_i=~df_invalid_i
Invalid_cycles_i=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_i)
print(df)
Your code sample leaves me a little confused, but focusing on
I want to make the same changes to all of the data frames.
and
How do I, in general, access df_1 using an iterator?
you can do exactly that by organizing your dataframes (dfs) in a dictionary (dict).
Here's how:
Assuming you've got a bunch of variables in your namespace...
# Imports
import pandas as pd
import numpy as np
# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['a', 'b'])
df_1 = df_1.set_index(rng)
# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['c', 'd'])
df_2 = df_2.set_index(rng)
# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['e', 'f'])
df_3 = df_3.set_index(rng)
...you can identify all that are dataframes using:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
If you've got a lot of different dataframes but would only like to focus on those that have a prefix like 'df_', you can identify those by...
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
... and then organize them in a dict using:
myFrames = {}
for dfName in dfNames:
myFrames[dfName] = eval(dfName)
From that list of interesting dataframes, you can subset those that you'd like to do something with. Here's how you focus only on df_1 and df_2:
invalid = ['df_3']
for inv in invalid:
myFrames.pop(inv, None)
Now you can reference ALL your valid dfs by looping through them:
for key in myFrames.keys():
print(myFrames[key])
And that should cover the...
How do I, in general, access df_1 using an iterator?
...part of the question.
And you can of course reference a single dataframe by its name / key in the dict:
print(myFrames['df_1'])
From here you can do something with ALL columns in ALL dataframes.
for key in myFrames.keys():
myFrames[key] = myFrames[key]*10
print(myFrames[key])
Or, being a bit more pythonic, you can specify a lambda function and apply that to a subset of columns
# A function
decimator = lambda x: x/10
# A subset of columns:
myCols = ['SPEED1', 'SPEED2']
Apply that function to your subset of columns in your dataframes of interest:
for key in myFrames.keys():
for col in list(myFrames[key]):
if col in myCols:
myFrames[key][col] = myFrames[key][col].apply(decimator)
print(myFrames[key][col])
So, back to your function...
modify(df_1,1)
... here's my take on it wrapped in a function.
First we'll redefine the dataframes and the function.
Oh, and with this setup, you're going to have to obtain all dfs OUTSIDE your function with alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)].
Here's the datasets and the function for an easy copy-paste:
# Imports
import pandas as pd
import numpy as np
# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_1 = df_1.set_index(rng)
# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_2 = df_2.set_index(rng)
# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_3 = df_3.set_index(rng)
# A function that divides columns by 10
decimator = lambda x: x/10
# A reference to all available dataframes
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
# A function as per your request
def modify(dfs, cols, fx):
""" Define a subset of available dataframes and list of interesting columns, and
apply a function on those columns.
"""
# Subset all dataframes with names that start with df_
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
# Organize those dfs in a dict if they match the dataframe names of interest
myFrames = {}
for dfName in dfNames:
if dfName in dfs:
myFrames[dfName] = eval(dfName)
print(myFrames)
# Apply fx to the cols of your dfs subset
for key in myFrames.keys():
for col in list(myFrames[key]):
if col in cols:
myFrames[key][col] = myFrames[key][col].apply(decimator)
# A testrun. Results in screenshots below
modify(dfs = ['df_1', 'df_2'], cols = ['SPEED1', 'SPEED2'], fx = decimator)
Here are dataframes df_1 and df_2 before manipulation:
Here are the dataframes after manipulation:
Anyway, this is how I would approach it.
Hope you'll find it useful!
I face a problem of modification of a dataframe inside a function that I have never observed previously. Is there a method to deal with this so that the initial dataframe is not modified.
In[30]: def test(df):
df['tt'] = np.nan
return df
In[31]: dff = pd.DataFrame(data=[])
In[32]: dff
Out[32]:
Empty DataFrame
Columns: []
Index: []
In[33]: df = test(dff)
In[34]: dff
Out[34]:
Empty DataFrame
Columns: [tt]
Index: []
def test(df):
df = df.copy(deep=True)
df['tt'] = np.nan
return df
If you pass the dataframe into a function and manipulate it and return the same dataframe, you are going to get the same dataframe in modified version. If you want to keep your old dataframe and create a new dataframe with your modifications then by definition you have to have 2 dataframes. The one that you pass in that you don't want modified and the new one that is modified. Therefore, if you don't want to change the original dataframe your best bet is to make a copy of the original dataframe. In my example I rebound the variable "df" in the function to the new copied dataframe. I used the copy method and the argument "deep=True" makes a copy of the dataframe and its contents. You can read more here:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html