I have a for loop, in which I want to call a different pd.Dataframes in each loop and add a certain column ('feedin') to another dataframe. The variable name consists of 'feedin_' + x. Lets say a,b and c. So in the first loop I want to call the variable feedin_a and add the column 'feedin' to the new dataframe. In the next feedin_b and so on.
I pass a list of ['a', 'b', 'c'] and try to combine feedin_+ a. But since the list consists of string parameters it wont call the variable
feedin_a = pd.Dataframe
feedin_b = pd.Dataframe
feedin_c = pd.Dataframe
list = ['a', 'b', 'c']
for name in list:
df.new['feedin_'+name] = 'feedin_'+name['feedin']
Which doesnt work because the variable is called as a string. I hope you get my problem.
This is one of the reasons it's not a great idea to use variables feedin_a, feedin_b, when you really need a data structure like a list.
If you can't change this, you can looking up the names in the locals() (or globals() dictionary:
df.new['feedin_'+name] = locals()['feedin_'+name]['feedin']
You can use locals() so it should be something like that:
feedin_a = pd.Dataframe
feedin_b = pd.Dataframe
feedin_c = pd.Dataframe
name_list = ['a', 'b', 'c']
for name in list:
key = 'feedin_{}'.format(name)
df.new[key] = locals()[key]
also check this Python, using two variables in getattr?
p.s. do not use list as variable name because it is built-in python object
Others have answered your exact question by using locals(). In the comments someone pointed out another way to achieve the thing you're after: using a dictionary to hold your "variable names" as string that can be looked up with string manipulation.
import pandas as pd
dataframes = dict()
letter_list = ['a', 'b', 'c']
for letter in letter_list:
dataframes['feedin_'+letter] = pd.DataFrame()
for name in letter_list:
dataframes['feedin_'+name].insert(0, 'feedin_'+name, None)
print(dataframes)
{'feedin_a': Empty DataFrame
Columns: [feedin_a]
Index: [], 'feedin_b': Empty DataFrame
Columns: [feedin_b]
Index: [], 'feedin_c': Empty DataFrame
Columns: [feedin_c]
Index: []}
If you intend to do a lot of calling your DataFrames by string manipulation, this is a good way to go.
Related
I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.
I am attempting to create four new pandas dataframes via a list comprehension. Each new dataframe should be the original 'constituents_list' dataframe with two new columns. These two columns add a defined number of years to an existing column and return the value. The example code is below
def add_maturity(df, tenor):
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
year_list = [3, 5, 7, 10]
new_dfs = [add_maturity(constituents_file, tenor) for tenor in year_list]
My expected output in in the new_dfs list should have four dataframes, each with a different value for 'tenor' and 'maturity'. In my results, all four dataframes have the same data with 'tenor' of '10Y' and a 'maturity' that is 10 years greater than the 'effectivedate' column.
I suspect that each time I iterate through the list comprehension each existing dataframe is overwritten with the latest call to the function. I just can't work out how to stop this happening.
Many thanks
When you're assigning to the DataFrame object, you're modifying in place. And when you pass it as an argument to a function, what you're passing is a reference to the DataFrame object, in this case a reference to the same DataFrame object every time, so that's overwriting the previous results.
To solve this issue, you can either create a copy of the DataFrame at the start of the function:
def add_maturity(df, tenor):
df = df.copy()
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
(Or you could keep the function as is, and have the caller copy the DataFrame first when passing it as an argument...)
Or you can use the assign() method, which returns a new DataFrame with the modified columns:
def add_maturity(df, tenor):
return df.assign(
tenor= str(tenor) + 'Y',
maturity=df['effectivedate'] + pd.DateOffset(years=tenor),
)
(Personally, I'd go with the latter. It's similar to how most DataFrame methods work, in that they typically return a new DataFrame rather than modifying it in place.)
I'm trying to perform a number of operations on a list of dataframes. I've opted to use a dictionary to help me with this process, but I was wonder if it's possible to reference the originally created dataframe with the changes.
So using the below code as an example, is it possible to call the dfA object with the columns ['a', 'b', 'c'] that were added when it was nested within the dictionary object?
dfA = pd.DataFrame(data=[1], columns=['x'])
dfB = pd.DataFrame(data=[1], columns=['y'])
dfC = pd.DataFrame(data=[1], columns=['z'])
dfdict = {'A':dfA,
'B':dfB,
'C':dfC}
df_dummy = pd.DataFrame(data=[[1,2,3]], columns=['a', 'b', 'c'])
for key in dfdict:
dfdict[str(key)] = pd.concat([dfdict[str(key)], df_dummy], axis=1)
The initial dfA that you created and the dfA DataFrame from your dictionary are two different objects. (You can confirm this by running dfA is dfdict['A'] or id(dfA) == id(dfdict['A']), both of which should return False).
To access the second (newly created) object you need to call it from the dictionary.
dfdict['A']
Or:
dfdict.get('A')
The returned DataFrame will have the new columns you added.
I have a DataFrame like:
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1],[3,4],[6,0])})
I want to create a new column called test that displays a 1 if a 0 exists within each list in column B. The results hopefully would look like:
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1],[3,4],[6,0]), 'test': (1,0,1)})
With a dataframe that contains strings rather than lists I have created additional columns from the string values using the following
df.loc[df['B'].str.contains(0),'test']=1
When I try this with the example df, I generate a
TypeError: first argument must be string or compiled pattern
I also tried to convert the list to a string but only retained the first integer in the list. I suspect that I need something else to access the elements within the list but don't know how to do it. Any suggestions?
This should do it for you:
df['test'] = pd.np.where(df['B'].apply(lambda x: 0 in x), 1, 0)
I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')