I face a problem of modification of a dataframe inside a function that I have never observed previously. Is there a method to deal with this so that the initial dataframe is not modified.
In[30]: def test(df):
df['tt'] = np.nan
return df
In[31]: dff = pd.DataFrame(data=[])
In[32]: dff
Out[32]:
Empty DataFrame
Columns: []
Index: []
In[33]: df = test(dff)
In[34]: dff
Out[34]:
Empty DataFrame
Columns: [tt]
Index: []
def test(df):
df = df.copy(deep=True)
df['tt'] = np.nan
return df
If you pass the dataframe into a function and manipulate it and return the same dataframe, you are going to get the same dataframe in modified version. If you want to keep your old dataframe and create a new dataframe with your modifications then by definition you have to have 2 dataframes. The one that you pass in that you don't want modified and the new one that is modified. Therefore, if you don't want to change the original dataframe your best bet is to make a copy of the original dataframe. In my example I rebound the variable "df" in the function to the new copied dataframe. I used the copy method and the argument "deep=True" makes a copy of the dataframe and its contents. You can read more here:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html
Related
My raw data is in multiple datafiles that have the same format. After importing the various (10) csv files using pd.read_csv(filename.csv) I have a series of dataframes df1, df2, df3 etc etc
I want to perform all of the below code to each of the dataframes.
I therefore created a function to do it:
def my_func(df):
df = df.rename(columns=lambda x: x.strip())
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.date = pd.to_datetime(df.date)
df = df.join(df['long_margin'].str.split(' ', 1, expand=True).rename(columns={0:'A', 1:'B'}))
df = df.drop(columns=['long_margin'])
df = df.drop(columns=['cash_interest'])
mapping = {df.columns[6]: 'daily_turnover', df.columns[7]: 'cash_interest', df.columns[8]: 'long_margin', df.columns[9]: 'short_margin'}
df = df.rename(columns=mapping)
return(df)
and then tried to call the function as follows:
list_of_datasets = [df1, df2, df3]
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
If I manually ran this code changing df to df1, df2 etc it works, but it doesn't seem to work in my function (or the way I am calling it).
What am I missing?
As I understand, in
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
dataframe is a pointer to an object in the list. It is not the DataFrame itself. When for x in something: is executed, Python creates a new variable x, which points to an element of the list, and (this new pointer) is usually discarded by you when the loop ends (the pointer (the new variable created by the loop) is not deleted though).
If inside the function you just modify this object "by reference", it's ok. The changes will propagate to the object in the list.
But as soon as the function starts to create a new object named "df" instead of the previous object (not modifying the previous, but creating a new one with a new ID) and then returning this new object to dataframe in the for loop, the assignment of this new object to dataframe will basically mean that dataframe will start to point to the new object instead of the element of the list. And the element in the list won't be affected or rather will be affected to the point when the function created a new DataFrame instead of the previous.
In order to see when exactly it happens, I would suggest that you add print(id(df)) after (and before) each line of code in the function and in the loop. When the id changes, you deal with the new object (not with the element of the list).
Alex is correct.
To make this work you could use list comprehension:
list_of_datasets = [my_func(df) for df in list_of_datasets]
or create a new list for the outputs
formatted_dfs = []
for dataframe in list_of_datasets:
formatted_dfs.append(my_func(dataframe))
I am iterating over a series of csv files as dataframes, eventually writing them all out to a common excel workbook.
In one of the many files, there are decimal GPS values (latitude, longitude) split into two columns (df[4] and df[5]) that I'm converting to degrees-minutes-seconds. That method returns a tuple that I'm attempting to park in two new fields called dmslat and dmslon in the same row of the original dataframe:
def convert_dd_to_dms(lat, lon):
# does the math here
return dmslat, dmslon
csv_dir = askdirectory() # tkinter directory picker
os.chdir(csv_dir)
for f in glob.iglob("*.csv"):
(csv_path, csv_name) = os.path.split(f)
(csv_prefix, csv_ext) = os.path.splitext(csv_name)
if csv_prefix[-3:] == "loc":
df = pd.read_csv(f)
df['dmslat'] = None
df['dmslon'] = None
for i, row in df.iterrows():
fixed_coords = convert_dd_to_dms(row[4], row[5])
row['dmslat'] = fixed_coords[0]
row['dmslon'] = fixed_coords[1]
print(df)
# process the other files
So when I use a print() statement I can see the coords are properly calculated but they are not being committed to the dmslat/dmslon fields.
I have also tried assigning the new fields within the row iterator, but since I am at the row scale, it ends up overwriting the entire column with the new calculated value every time.
How can I get the results to (succinctly) populate the columns?
It would appear that df.iterrows() is resulting in a "copy" of each row, thus when you add/update the columns "dmslat" and "dmslon", you are modifying the copy, not the original dataframe. This can be confirmed by printing "row" after your assignments. You will see the row item was successfully updated, but the changes are not reflected in the original dataframe.
To modify the original dataframe, you can modify your code as such:
for i, row in df.iterrows():
fixed_coords = convert_dd_to_dms(row[4], row[5])
df.loc[i, 'dmslat'] = fixed_coords[0]
df.loc[i, 'dmslon'] = fixed_coords[1]
print(df)
using df.loc guarantees the changes are made to the original dataframe.
I think you better use apply rather than iterrows.
Here's a solution that is based on apply. I replaced your location calculation with a function named 'foo' which does some arbitrary calculation from two fields 'a' and 'b' to new values for 'a' and 'b'.
df = pd.DataFrame({"a": range(10), "b":range(10, 20)})
def foo(row):
return (row["a"] + row["b"], row["a"] * row["b"])
new_df = df.apply(foo, axis=1).apply(pd.Series)
In the above code block, applying 'foo' returns a tuple for every row. Using apply again with pd.Series turns it into a data frame.
df[["a", "b"]] = new_df
df.head(3)
a b
0 10 0
1 23 132
2 38 336
I am attempting to create four new pandas dataframes via a list comprehension. Each new dataframe should be the original 'constituents_list' dataframe with two new columns. These two columns add a defined number of years to an existing column and return the value. The example code is below
def add_maturity(df, tenor):
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
year_list = [3, 5, 7, 10]
new_dfs = [add_maturity(constituents_file, tenor) for tenor in year_list]
My expected output in in the new_dfs list should have four dataframes, each with a different value for 'tenor' and 'maturity'. In my results, all four dataframes have the same data with 'tenor' of '10Y' and a 'maturity' that is 10 years greater than the 'effectivedate' column.
I suspect that each time I iterate through the list comprehension each existing dataframe is overwritten with the latest call to the function. I just can't work out how to stop this happening.
Many thanks
When you're assigning to the DataFrame object, you're modifying in place. And when you pass it as an argument to a function, what you're passing is a reference to the DataFrame object, in this case a reference to the same DataFrame object every time, so that's overwriting the previous results.
To solve this issue, you can either create a copy of the DataFrame at the start of the function:
def add_maturity(df, tenor):
df = df.copy()
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
(Or you could keep the function as is, and have the caller copy the DataFrame first when passing it as an argument...)
Or you can use the assign() method, which returns a new DataFrame with the modified columns:
def add_maturity(df, tenor):
return df.assign(
tenor= str(tenor) + 'Y',
maturity=df['effectivedate'] + pd.DateOffset(years=tenor),
)
(Personally, I'd go with the latter. It's similar to how most DataFrame methods work, in that they typically return a new DataFrame rather than modifying it in place.)
Tried creating dummies function which creates dummies and concatenates original df with dummies df. when a dataframe is passes through the function, I dont see any changes in df!
def get_dummies(df, col):
colLabel = pd.get_dummies(df[col])
df = pd.concat([df, colLabel], axis=1)
get_dummies(train_set1, 'jobtype')
train_set1 wont change!
You need the function to return the frame and assign it back:
def get_dummies(df, col):
colLabel = pd.get_dummies(df[col])
df = pd.concat([df, colLabel], axis=1)
return df
train_set1 = get_dummies(train_set1, 'jobtype')
If you're absolutely insistent on doing it the way you've asked, you could potentially assigning the DataFrame a __name__ attribute, and updated the frame in the globals() dict of variables (definately not advised though!):
def get_dummies(df, col):
colLabel = pd.get_dummies(df[col])
new_df = pd.concat([df, colLabel], axis=1)
globals()[df.__name__] = new_df
train_set1.__name__ = 'train_set1'
get_dummies(train_set1, 'jobtype')
Concat returns a copy of the data, so normally the operation is not in place.
A number of pandas functions have a "in_place" argument. Set this to true, if you want to modify the dataset, rather than get a copy.
Concat does not have such a argument, but an "copy" argument. Try setting that to false.
Pandas' pandas.concat function by default copies the data frame on concatenation. Essentially, this generates a new data frame which is stored in your local df variable and replaces the reference to the original data frame passed from outside. As a consequence, upon assignment to df, you don't change the original data frame anymore but replace it with a new one only inside the function.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
Setting copy=False in your call would modify the data frame in place and not replace the reference by a new data frame which would never be able to leave the function scope.
I have pandas DataFrame df with different types of columns, some values of df are NaN.
To test some assumption, I create copy of df, and transform copied df to (0, 1) with pandas.isnull():
df_copy = df
for column in df_copy:
df_copy[column] = df_copy[column].isnull().astype(int)
but after that BOTH df and df_copy consist of 0 and 1.
Why this code transforms df to 0, 1 and is there way to prevent it?
You can prevent it declaring:
df_copy = df.copy()
This creates a new object. Prior to that you essentially had two pointers to the same object. You also might want to check this answer and note that DataFrames are mutable.
Btw, you could obtain the desired result simply by:
df_copy = df.isnull().astype(int)
even better memory-wise
for column in df:
df[column + 'flag'] = df[column].isnull().astype(int)