I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.
Related
I have a process which I am able to loop through for values held in a list but it overwrites the final dataframe with each
loop and I would like to append or concat the result of the loops into one dataframe.
For example given below I can see 'dataframe' will populate initially with result of 'blah1', then when process finishes it has the result of 'blah2'
listtoloop = ['blah1','blah2']
for name in listtoloop:
some process happens here resulting in
dataframe = result of above process
The typical pattern used for this is to create a list of DataFrames, and only at the end of the loop, concatenate them into a single DataFrame. This is usually much faster than appending new rows to the DataFrame after each step, as you are not constructing a new DataFrame on every iteration.
Something like this should work:
listtoloop = ['blah1','blah2']
dfs = []
for name in listtoloop:
# some process happens here resulting in
# dataframe = result of above process
dfs.append(dataframe)
final = pd.concat(dfs, ignore_index=True)
Put your results in a list and then append the list to the df making sure the list is in the same order as the df
listtoloop = ['blah1','blah2']
df = pd.DataFrame(columns="A","B")
for name in listtoloop:
## processes here
to_append = [5, 6]
df_length = len(df)
df.loc[df_length] = to_append
data_you_need=pd.DataFrame()
listtoloop = ['blah1','blah2']
for name in listtoloop:
##some process happens here resulting in
##dataframe = result of above process
data_you_need=data_you_need.append(dataframe,ignore_index=True)
I am attempting to create four new pandas dataframes via a list comprehension. Each new dataframe should be the original 'constituents_list' dataframe with two new columns. These two columns add a defined number of years to an existing column and return the value. The example code is below
def add_maturity(df, tenor):
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
year_list = [3, 5, 7, 10]
new_dfs = [add_maturity(constituents_file, tenor) for tenor in year_list]
My expected output in in the new_dfs list should have four dataframes, each with a different value for 'tenor' and 'maturity'. In my results, all four dataframes have the same data with 'tenor' of '10Y' and a 'maturity' that is 10 years greater than the 'effectivedate' column.
I suspect that each time I iterate through the list comprehension each existing dataframe is overwritten with the latest call to the function. I just can't work out how to stop this happening.
Many thanks
When you're assigning to the DataFrame object, you're modifying in place. And when you pass it as an argument to a function, what you're passing is a reference to the DataFrame object, in this case a reference to the same DataFrame object every time, so that's overwriting the previous results.
To solve this issue, you can either create a copy of the DataFrame at the start of the function:
def add_maturity(df, tenor):
df = df.copy()
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
(Or you could keep the function as is, and have the caller copy the DataFrame first when passing it as an argument...)
Or you can use the assign() method, which returns a new DataFrame with the modified columns:
def add_maturity(df, tenor):
return df.assign(
tenor= str(tenor) + 'Y',
maturity=df['effectivedate'] + pd.DateOffset(years=tenor),
)
(Personally, I'd go with the latter. It's similar to how most DataFrame methods work, in that they typically return a new DataFrame rather than modifying it in place.)
Regards,
Apologies if this question appears be to a duplicate of other questions. But I could find an answer that addresses my problem in its exactitude.
I split a dataframe, called "data", into multiple subsets that are stored in a dictionary of dataframes named "dfs" as follows:
# Partition DF
dfs = {}
chunk = 5
for n in range((data.shape[0] // chunk + 1)):
df_temp = data.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Now, I would like to apply a pre-defined helper function called "fun_c" to EACH of the dataframes (that are stored in the dictionary object called "dfs").
Is it correct for me to apply the function to the dfs in one go, as follows(?):
result = fun_c(dfs)
If not, what would be the correct way of doing this?
it depends on the output you're looking for:
If you want a dict in the output, then you should apply the function to each dict item
result = dict({key: fun_c(val) for key, val in dfs.items()})
If you want a list of dataframes/values in the output, then apply the function to each dict value
result = [fun_c(val) for val in dfs.items()]
But this style isnt wrong either, you can iterate however you like inside the helper function as well:
def fun_c(dfs):
result = None
# either
for key, val in dfs.items():
pass
# or
for val in dfs.values():
pass
return result
Let me know if this helps!
Since you want this:
Now, I would like to apply a pre-defined helper function called
"fun_c" to EACH of the dataframes (that are stored in the dictionary
object called "dfs").
Let's say your dataframe dict looks like this and your helper function takes in a single dataframe.
dfs = {0 : df0, 1: df1, 2: df2, 3:df3}
Let's iterate through the dictionary, apply the fun_c function on each of the dataframes, and save the results in another dictionary having the same keys:
dfs_result = {k:fun_c[v] for k, v in dfs.items()}
I'm trying to split an array in a data-frame column and append the individual entries to a new data frame.
I managed to write a function that seems to be able to iterate over the individual entries. But when I try to append them to another data frame, the data frame stays empty.
Can I even edit a data frame from within a function?
import pandas as pd
# Original data frame
series1 = pd.Series([['cat', 'dog', 'rabbit'], ['frog', 'moose', 'fly']])
oldDF = pd.DataFrame(series1)
# New data frame where I want to populate all values in the old
series2 = pd.Series([])
newDF = pd.DataFrame(series2)
# Define function to iterate over each array
def appendItems(x, df):
for item in x:
for i in item:
# Trying to append entries to new dataframe
df.append(pd.Series([i]), ignore_index=True)
print(pd.Series([i]))
# Apply above function to dataframe
oldDF.apply(appendItems,args=[newDF])
# Result-> empty data frame :-(
print("Checking result")
newDF.head()
The problem that you are facing with your appendItems function is that it uses df.append() which is creating a copy and does not modify in place.
df.append() uses pd.concat() under the hood.
if you really want to use your appendItems function, you should use df.loc[] to modify directly df and not a copy.
Here is an example:
def appendItems(x, df):
for i, item in enumerate(pd.np.hstack(x.values.tolist())):
df.loc[i, 0] = item
(np.hstack is just used to flatten the nested list of values)
links:
pd.DataFrame.append
pd.concat
np.hstack
You could try using numpy.concatenate
import numpy as np
pd.DataFrame(np.concatenate(oldDF[0]))
[output]
0
0 cat
1 dog
2 rabbit
3 frog
4 moose
5 fly
I am a newbie to python. I am trying iterate over rows of individual columns of a dataframe in python. I am trying to create an adjacency list using the first two columns of the dataframe taken from csv data (which has 3 columns).
The following is the code to iterate over the dataframe and create a dictionary for adjacency list:
df1 = pd.read_csv('person_knows_person_0_0_sample.csv', sep=',', index_col=False, skiprows=1)
src_list = list(df1.iloc[:, 0:1])
tgt_list = list(df1.iloc[:, 1:2])
adj_list = {}
for src in src_list:
for tgt in tgt_list:
adj_list[src] = tgt
print(src_list)
print(tgt_list)
print(adj_list)
and the following is the output I am getting:
['933']
['4139']
{'933': '4139'}
I see that I am not getting the entire list when I use the list() constructor.
Hence I am not able to loop over the entire data.
Could anyone tell me where I am going wrong?
To summarize, Here is the input data:
A,B,C
933,4139,20100313073721718
933,6597069777240,20100920094243187
933,10995116284808,20110102064341955
933,32985348833579,20120907011130195
933,32985348838375,20120717080449463
1129,1242,20100202163844119
1129,2199023262543,20100331220757321
1129,6597069771886,20100724111548162
1129,6597069776731,20100804033836982
the output that I am expecting:
933: [4139,6597069777240, 10995116284808, 32985348833579, 32985348838375]
1129: [1242, 2199023262543, 6597069771886, 6597069776731]
Use groupby and create Series of lists and then to_dict:
#selecting by columns names
d = df1.groupby('A')['B'].apply(list).to_dict()
#seelcting columns by positions
d = df1.iloc[:, 1].groupby(df1.iloc[:, 0]).apply(list).to_dict()
print (d)
{933: [4139, 6597069777240, 10995116284808, 32985348833579, 32985348838375],
1129: [1242, 2199023262543, 6597069771886, 6597069776731]}