Iterating over dataframes and adding items from a list - python

Quite new to python for data analysis, still a noob.
I have a list of pandas data frames (+100) who's variables are saved into a list.
I then have the variables saved in another list in string format to add into the dataFrames as an identifier when plotting.
I have defined a function to prepare the tables for later feature engineering.
I want to iterate through each data frame and add the corresponding strings into a column called "Strings"
df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
def mindex(df):
# remove time index and insert Strings column
df.reset_index(inplace=True)
df.insert(1, "Strings", "")
# iterate through each table adding the string values
for item in enumerate(df):
for item2 in strings:
df['Strings'] = item2
# the loop to cycle through all the dateframes using the function above
for i in df:
mindex(i)
When ever I use the function above it only fills the last value into all of the dataframes. I would like to note that all the dataframes are within the same date range, as I have tried to use this as a way to stop the iteration with no win.
Can anyone point me in the right direction! Google has not been my friend so far

df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
for s, d in zip(strings, df):
d['Strings'] = s

In line df['Strings'] = item2 you assign variable item2 into entire column df["Strings"].
So first iteration assigns "df1", second assigns "df2" and ends with "df3" and this is what you see finally.
if you want to have in column Strings entirely populated by "df1" for df1, "df2" for df2 etc. you have to:
def mindex(dfs: list, strings: list) -> list:
final_dfs = []
for single_df, df_name in zip(dfs, strings):
single_df = single_df.copy()
single_df.reset_index(inplace=True)
single_df.insert(1, "Strings", "")
single_df['Strings'] = df_name
final_dfs.append(single_df)
return final_dfs
dfs = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
result = mindex(dfs, strings)
Few takeaways:
if you define list of dfs, name it dfs (plural), not df.
dfs = [df1, df2, df3]
If you iterate through pandas DataFrame, use df.iterrows(). It will generate indices and rows, so you don't need to apply enumerate.
for idx, row in df.iterrows():
....
if you use variable in for loop that is not going to be used, like in your example item, use underscore instead. It is good practice for useless variable:
for _ in enumerate(df):
for item2 in strings:
df['Strings'] = item2

Related

Create a new column in multiple dataframes using for loop

I have multiple dataframes with the same structure but different values
for instance,
df0, df1, df2...., df9
To each dataframe I want to add a column named eventdate that consists of one date, for instance, 2019-09-15 using for loop
for i in range(0, 9);
df+str(i)['eventdate'] = "2021-09-15"
but I get an error message
SyntaxError: cannot assign to operator
I think it's because df isn't defined. This should be very simple.. Any idea how to do this? thanks.
dfs = [df0, df1, df2...., df9]
dfs_new = []
for i, df in enumerate(dfs):
df['eventdate'] = "2021-09-15"
dfs_new.append(df)
if you can't generate a list then you could use eval(f"df{str(num)}") but this method isn't recommended from what I've seen

Loop over a list in python and create a dataframe

I have a list "profiles" looking like this for each element of the list: enter image description here
I want to create a dataframe out of this list, so that every account is in the dataframe.
When I create one dataframe for every element of the list manually and then append them all to one dataframe it works. However, I wanted to do a loop because I have about 40 elements in that list.
The code looks as follows (doing it manually):
df2 = pd.DataFrame(profiles[1])
df1 = df1.append(df2)
df3 = pd.DataFrame(profiles[2])
df1 = df1.append(df3)
... and so on
My loop looks like this:
for y in range(len(profiles)):
if x != y:
df1 = pd.DataFrame(profiles[x])
df2 = pd.DataFrame(profiles[y])
df1.append(df2)
Does anybody know how to make the loop work? Because it does not append the second dataframe.
you are overwriting your df1 in each iteration of the loop
IIUC, In the absence of the reproducible data and example, this is what you should add to your code
outside of the for loop, create an empty dataframe
df_final = pd.DataFrame()
inside the loop
df_final = pd.concat([df_final, df1], axis=1)
if your df1 and df2 both gets initialized in each iteration, then you can add them both to the df_final together as follows
df_final = pd.concat([df_final, df1, df2], axis=1)
at the end of the loop df_final will have all the df's appended to it
append is not an inPlace Operation, so you have to assign the result of the function (otherwise it gets lost)

Apply changes to dataframes in a dict thanks to a for loop: how to do it?

I can't apply the alterations I make to dataframes inside a dictionary. The changes are done with a for loop.
The problem is that although the loop works because the single iterated df makes the changes, they do not apply to the dictionary they are in.
The end goal is to create a merge of all the dataframes since they come from different excel sheets and sheets.
Here the code:
Import the two excel files, assigning None to the Sheet_Name parameter in order to import all the sheets of the document into a dict. I have 8 sheet in EEG excel file and 5 in SC file
import numpy as np
impody pandas as np
eeg = pd.read_excel("path_file", sheet_name = None)
sc = pd.read_excel("path_file" sheet_name = None)
Merges the first dictionary with the second one with the update method. Now the EEG dict contains both EEG and SC.
So now I have a dict with 13 df inside
eeg.update(sc)
The loop for is needed in order to carry out some modifications inside the single df.
reset the index to a specific column (common on all df), change its name, add a prefix on the variable that corresponds to the key of the df and lastly change the 0 with nan.
for key, df in eeg.items():
df.set_index(('Unnamed: 0'), inplace = True)
df.index.rename(('Sbj'), inplace = True)
df = df.add_prefix( key + '_')
df.replace (0, np.nan, inplace = True)
Although the loop is set on the dictionary items and the single iterated dataframe works, I don't see the changes on the dictionary df's and therefore can't proceed to extract them into a list, then merge.
As you can see in the fig.1 the single df in the for loop is good!
but when I go to the df in dict, they still result as before.
You need to map your modified dataframe back into your dictionary:
for key, df in eeg.items():
df.set_index(('Unnamed: 0'), inplace = True)
df.index.rename(('Sbj'), inplace = True)
df = df.add_prefix( key + '_')
df.replace (0, np.nan, inplace = True)
eeg[key] = df #map df back into eeg
What you probably want is:
# merge the dataframes in your dictionary into one
df1 = pd.DataFrame()
for key, df in eeg.items():
df1 = pd.concat([df1,df])
# apply index-changes to the merged dataframe
df1.set_index(('Unnamed: 0'), inplace = True)
df1.index.rename(('Sbj'), inplace = True)
df1 = df1.add_prefix( key + '_')
df1.replace (0, np.nan, inplace = True)

Why does passing a dataframe to list.extend() result in only column names stored in the list?

I need lists of column names for several dataframes stored in a dictionary. It turns out that I accidentally got the desired result, but I thought that the code would work differently. Could someone explain why this code works?
Initial idea: loop through dictionary keys, add values (dataframes) to target lists --> get lists of dataframes --> [somehow] extract column names from dataframes.
What worked: loop through dictionary keys, add values (dataframes) to target lists --> get lists of dataframes column names, nothing further required.
list1 = []
list2 = []
list3 = []
for key in dfDict.keys():
# each dfDict key has a value tuple of 3 dataframes --> key: (df1,df2,df3)
list1.extend(dfDict[key][0]) # for df1
list2.extend(dfDict[key][1]) # for df2
list3.extend(dfDict[key][2]) # for df3
Expected:
list1 = [df1]
list2 = [df2]
list3 = [df3]
Actual:
list1 = [df1.columns]
list2 = [df2.columns]
list3 = [df3.columns]
It's awesome, but why?
list.extend iterates over its argument, and DataFrame.__iter__ iterates over the dataframe's column names. There is not much more to it.
df = pd.DataFrame([], columns=['a', 'b'])
print([col_name for col_name in df])
Outputs
['a', 'b']
This is somewhat analogues to dict.__iter__ iterating over the keys.
df[col] for col in df
behaves "the same" as
dict[key] for key in dict
Either way in your use case you should use append (and as shown above, you don't have to explicitly use .keys)
for key in dfDict:
list1.append(dfDict[key][0])
list2.append(dfDict[key][1])
list3.append(dfDict[key][2])

Iteratively concatenate pandas dataframe with multiindex

I am iteratively processing a couple of "groups" and I would like to add them together to a dataframe with every group being identified by a 2nd level index.
This:
print pd.concat([df1, df2, df3], keys=["A", "B", "C"])
was suggested to me - but it doesn't play well with iteration.
I am currently doing
data_all = pd.DataFrame([])
for a in a_list:
group = some.function(a, etc)
group = group.set_index(['CoI'], append=True, drop=True)
group = group.reorder_levels(['CoI','oldindex'])
data_all = pd.concat([data_all, group], ignore_index=False)
But the last line totally destroys my multi-index and I cannot reconstruct it.
Can you give me a hand?
Should be able just make data_all a list and concatenate once at the end:
data_all = []
for a in a_list:
group = some.function(a, etc)
group = group.set_index(['CoI'], append=True, drop=True)
group = group.reorder_levels(['CoI','oldindex'])
data_all.append(group)
data_all = pd.concat(data_all, ignore_index=False)
Also keep in mind that pandas' concat works with iterators. Something like yield group may be more efficient than appending to a list each time. I haven't profiled it though!

Categories

Resources