I am using multiprocessing to run some pretty timely tasks concurrently creating a number of separate dataframes which I will merge into one later like so:
manager = Manager()
ns = manager.Namespace()
ns.df_one = data_format.init() # this just creates a new dataframe with predefined columns
ns.df_two = data_format.init() # this just creates a new dataframe with predefined columns
p_search_one = Process(target=search_function_one, args=(ns,))
p_search_one.start()
p_search_two = Process(target=search_function_one, args=(ns,))
p_search_two.start()
p_search_one.join()
p_search_two.join()
pprint(f'search_one: {ns.df_one }') # prints an empty dataframe
Then, in the function I am modifying the dataframe:
def search_function_one(ns):
df = ns.df_one
do_some_magic(df) # just adds rows to the dataframe, not returning anything just modifies in place
pprint(f'df from ns: {ns.df_one}') # prints an empty dataframe
pprint(f'df from ns: {df}') # prints an empty dataframe
I have also tried not making df a copy (?) of ns.df_one like so:
def search_function_one(ns):
do_some_magic(ns.df_one) # just adds rows to the dataframe, not returning anything just modifies in place
pprint(f'df from ns: {ns.df_one}') # prints an empty dataframe
But that just prints an empty dataframe. Without using concurrency this works as expected, modifying the dataframe in place, but with concurrency it doesn't work.
I'm also wondering where do_some_magic is the issue as its in another file, but it doesn't make a new reference of the input parameter, so it doesn't do df = input_var it just accesses the input variable directly.
Am I doing something fundamentally wrong in how I'm managing my datatypes?
Related
My raw data is in multiple datafiles that have the same format. After importing the various (10) csv files using pd.read_csv(filename.csv) I have a series of dataframes df1, df2, df3 etc etc
I want to perform all of the below code to each of the dataframes.
I therefore created a function to do it:
def my_func(df):
df = df.rename(columns=lambda x: x.strip())
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.date = pd.to_datetime(df.date)
df = df.join(df['long_margin'].str.split(' ', 1, expand=True).rename(columns={0:'A', 1:'B'}))
df = df.drop(columns=['long_margin'])
df = df.drop(columns=['cash_interest'])
mapping = {df.columns[6]: 'daily_turnover', df.columns[7]: 'cash_interest', df.columns[8]: 'long_margin', df.columns[9]: 'short_margin'}
df = df.rename(columns=mapping)
return(df)
and then tried to call the function as follows:
list_of_datasets = [df1, df2, df3]
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
If I manually ran this code changing df to df1, df2 etc it works, but it doesn't seem to work in my function (or the way I am calling it).
What am I missing?
As I understand, in
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
dataframe is a pointer to an object in the list. It is not the DataFrame itself. When for x in something: is executed, Python creates a new variable x, which points to an element of the list, and (this new pointer) is usually discarded by you when the loop ends (the pointer (the new variable created by the loop) is not deleted though).
If inside the function you just modify this object "by reference", it's ok. The changes will propagate to the object in the list.
But as soon as the function starts to create a new object named "df" instead of the previous object (not modifying the previous, but creating a new one with a new ID) and then returning this new object to dataframe in the for loop, the assignment of this new object to dataframe will basically mean that dataframe will start to point to the new object instead of the element of the list. And the element in the list won't be affected or rather will be affected to the point when the function created a new DataFrame instead of the previous.
In order to see when exactly it happens, I would suggest that you add print(id(df)) after (and before) each line of code in the function and in the loop. When the id changes, you deal with the new object (not with the element of the list).
Alex is correct.
To make this work you could use list comprehension:
list_of_datasets = [my_func(df) for df in list_of_datasets]
or create a new list for the outputs
formatted_dfs = []
for dataframe in list_of_datasets:
formatted_dfs.append(my_func(dataframe))
I have a process which I am able to loop through for values held in a list but it overwrites the final dataframe with each
loop and I would like to append or concat the result of the loops into one dataframe.
For example given below I can see 'dataframe' will populate initially with result of 'blah1', then when process finishes it has the result of 'blah2'
listtoloop = ['blah1','blah2']
for name in listtoloop:
some process happens here resulting in
dataframe = result of above process
The typical pattern used for this is to create a list of DataFrames, and only at the end of the loop, concatenate them into a single DataFrame. This is usually much faster than appending new rows to the DataFrame after each step, as you are not constructing a new DataFrame on every iteration.
Something like this should work:
listtoloop = ['blah1','blah2']
dfs = []
for name in listtoloop:
# some process happens here resulting in
# dataframe = result of above process
dfs.append(dataframe)
final = pd.concat(dfs, ignore_index=True)
Put your results in a list and then append the list to the df making sure the list is in the same order as the df
listtoloop = ['blah1','blah2']
df = pd.DataFrame(columns="A","B")
for name in listtoloop:
## processes here
to_append = [5, 6]
df_length = len(df)
df.loc[df_length] = to_append
data_you_need=pd.DataFrame()
listtoloop = ['blah1','blah2']
for name in listtoloop:
##some process happens here resulting in
##dataframe = result of above process
data_you_need=data_you_need.append(dataframe,ignore_index=True)
I have a certain data to clean, it's some keys where the keys have six leading zeros that I want to get rid of, and if the keys are not ending with "ABC" or it's not ending with "DEFG", then I need to clean the currency code in the last 3 indexes. If the key doesn't start with leading zeros, then just return the key as it is.
To achieve this I wrote a function that deals with string as below:
def cleanAttainKey(dirtyAttainKey):
if dirtyAttainKey[0] != "0":
return dirtyAttainKey
else:
dirtyAttainKey = dirtyAttainKey.strip("0")
if dirtyAttainKey[-3:] != "ABC" and dirtyAttainKey[-3:] != "DEFG":
dirtyAttainKey = dirtyAttainKey[:-3]
cleanAttainKey = dirtyAttainKey
return cleanAttainKey
Now I build a dummy data frame to test it but it's reporting errors:
data frame
df = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102]},
columns=["dirtyKey","amount"])
I need to get a new column called "cleanAttainKey" in the df, then modify each value in the "dirtyKey" using the "cleanAttainKey" function, then assign the cleaned key to the new column "cleanAttainKey", however it seems pandas doesn't support this type of modification.
# add a new column in df called cleanAttainKey
df['cleanAttainKey'] = ""
# I want to clean the keys and get into the new column of cleanAttainKey
dirtyAttainKeyList = df['dirtyKey'].tolist()
for i in range(len(df['cleanAttainKey'])):
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
I am getting the below error message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The result should be the same as the df2 below:
df2 = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102],
'cleanAttainKey':["12345ABC","12345DEFG","23456DEFG"]},
columns=["dirtyKey","cleanAttainKey","amount"])
df2
Is there any better way to modify the dirty keys and get a new column with the clean keys in Pandas?
Thanks
Here is the culprit:
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
When you use extract of the dataframe, Pandas reserves the ability to choose to make a copy or a view. It does not matter if you are just reading the data, but it means that you should never modify it.
The idiomatic way is to use loc (or iloc or [i]at):
df.loc[i, 'cleanAttainKey'] = cleanAttainKey(vpAttainKeyList[i])
(above assumes a natural range index...)
I am attempting to create four new pandas dataframes via a list comprehension. Each new dataframe should be the original 'constituents_list' dataframe with two new columns. These two columns add a defined number of years to an existing column and return the value. The example code is below
def add_maturity(df, tenor):
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
year_list = [3, 5, 7, 10]
new_dfs = [add_maturity(constituents_file, tenor) for tenor in year_list]
My expected output in in the new_dfs list should have four dataframes, each with a different value for 'tenor' and 'maturity'. In my results, all four dataframes have the same data with 'tenor' of '10Y' and a 'maturity' that is 10 years greater than the 'effectivedate' column.
I suspect that each time I iterate through the list comprehension each existing dataframe is overwritten with the latest call to the function. I just can't work out how to stop this happening.
Many thanks
When you're assigning to the DataFrame object, you're modifying in place. And when you pass it as an argument to a function, what you're passing is a reference to the DataFrame object, in this case a reference to the same DataFrame object every time, so that's overwriting the previous results.
To solve this issue, you can either create a copy of the DataFrame at the start of the function:
def add_maturity(df, tenor):
df = df.copy()
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
(Or you could keep the function as is, and have the caller copy the DataFrame first when passing it as an argument...)
Or you can use the assign() method, which returns a new DataFrame with the modified columns:
def add_maturity(df, tenor):
return df.assign(
tenor= str(tenor) + 'Y',
maturity=df['effectivedate'] + pd.DateOffset(years=tenor),
)
(Personally, I'd go with the latter. It's similar to how most DataFrame methods work, in that they typically return a new DataFrame rather than modifying it in place.)
Why would the following code not affect the Output DataFrame? (This example is not interesting in itself - it is a convoluted way of 'copying' a DataFrame.)
def getRow(row):
Output.append(row)
Output = pd.DataFrame()
Input = pd.read_csv('Input.csv')
Input.apply(getRow)
Is there a way of obtaining such a functionality that is using the apply function so that it affects other variables?
What happens
DataFrame.append() returns a new dataframe. It does not modify Output but rather creates a new one every time.
DataFrame.append(self, other, ignore_index=False, verify_integrity=False)
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
Here:
Output.append(row)
you create a new dataframe but throw it away immediately.
You have access - But you shouldn't use it in this way
While this works, I strongly recommend against using global:
df = DataFrame([1, 2, 3])
df2 = DataFrame()
def get_row(row):
global df2
df2 = df2.append(row)
df.apply(get_row)
print(df2)
Output:
0 1 2
0 1 2 3
Take it as demonstration what happens. Don't use it in your code.