Creating lists of dataframes to iterate over those dataframes - python

In Python, I'm reading an Excel file with multiple sheets, to create each sheet as its own dataframe. I then want to create a list of those dataframes 'names' so i can use it in list comprehension.
The code below lets me create all the sheets into their own dataframe, and because some of the sheet names have special characters, I use regex to replace them with an underscore:
df = pd.read_excel('Book1.xlsx', sheet_name=None)
df = {re.sub(r"[-+\s]", "_", k): v for k,v in df.items()}
for key in df.keys():
globals()[key] = df[key]
Then, I can get a list of the dataframes by: all_dfs= %who_ls DataFrame,
giving me: ['GB_SF_NZ', 'GF_1', 'H_2_S_Z']
When I try the following code:
for df in all_dfs:
df.rename(columns=df.iloc[2], inplace=True)
I get the error 'AttributeError: 'str' object has no attribute 'rename'
But, if I manually type out the names of the dataframes into a list (note here that they are not in quotation marks):
manual_dfs = [GB_SF_NZ, GF_1, H_2_S_Z]
for df in manual_dfs:
df.rename(columns=df.iloc[2], inplace=True)
I don't get an error and the code works as expected.
Can anyone help me with why one works (without quotation) but the other (extracted with quotation) doesn't, and
how can I create a list of dataframes without the quote marks?

Following your code, you can do.:
for df in all_dfs:
globals()[df].rename(columns=df.iloc[2], inplace=True)

Related

Remove special characters from dataframe names

In Python, I'm reading an excel file with multiple sheets, with the intention of each sheet being its own dataframe:
df = pd.read_excel('Book1.xlsx', sheet_name=None)
So to get the dictionary keys to each dataframe (or sheet) I can use: df.keys() which gives me each sheet name from the original Excel file: dict_keys(['GF-1', 'H_2 S-Z', 'GB-SF+NZ'])
I can then assign each dictionary into its own dataframe using:
for key in df.keys():
globals()[key] = df[key]
But, because the sheet names from the original Excel file contain special characters ( -, spaces, + etc), I can't call up any of the dataframes individually:
H_2 S-Z.head()
^
SyntaxError: invalid syntax
I know that dataframe 'names' cannot contain special characters or start with numbers etc, so how do I remove those special characters?
I don't think the dict_keys can be edited (e.g. using regex). Also thought about creating a list of the dataframes, then perhaps doing a regex for loop to iterate over each dataframe name, but not sure that it would assign the 'new' dataframe name back to each dataframe.
Can anyone help me?
You can use re.sub with a dictcomp to get rid of the characters (-, +, whitespace, ..) :
import re
dict_dfs = pd.read_excel("Book1.xlsx", sheet_name=None)
dict_dfs = {re.sub(r"[-+\s]", "_", k): v for k,v in dict_dfs.items()}
for key in dict_dfs.keys():
globals()[key] = dict_dfs[key]
As suggested by #cottontail, you can also use re.sub(r"\W", "_", k).
NB: As a result (in the global scope), you'll have as much variables (pandas.core.frame.DataFrame objects) as there is worksheets in your Excel file.
print([(var, type(val)) for var, val in globals().items()
if type(val) == pd.core.frame.DataFrame])
#[('GF-1', pandas.core.frame.DataFrame),
# ('H_2_S_Z', pandas.core.frame.DataFrame),
# ('GB_SF_NZ', pandas.core.frame.DataFrame)]
globals() is already a dictionary (you can confirm by isinstance(globals(), dict)), so the individual sheets can be accessed as any dict value:
globals()['H_2 S-Z'].head()
etc.
That being said, instead of creating individually named dataframes, I would think that storing the sheets as dataframes in a single dictionary may be more readable and accessible for you down the road. It's already creating problems given you cannot name your dataframes with the same name as the sheet names. If you change the dataframe names, then you'll need another mapping that tells you which sheet name corresponds to which dataframe name, so it's a lot of work tbh. As you already have a dictionary of dataframes in df, why not access the individual sheets by df['H_2 S-Z'] etc.?

problems manipulation dataframes in dictionary from pd.read_excel

Good day all,
I have an excel with multiple sheets which I load via pd.read_excel into a dict object.
Each excel sheet contains some column names which are the same:
Like sheet1: ['IDx','IDy'], sheet2: ['IDx','IDy']
I usually use pd.concat in order to load everything into one data frame. This way same column names also get merged correctly.
But this way is not robust in the case that there is some accidental extra white space in one of the ID columns but they're actually meant to get merged with pd.concat.
like: sheet1: ['IDx','IDy'], sheet2: ['IDx',' IDy']
To tackle this problem I thought to iterate over each data frame from the excel dict and strip() the column names of white space and then pd.concat afterwards.
excel = pd.read_excel('D:\ExcelFile.xlsx', sheet_name=None)
new_excel = {}
for name, sheet in excel.items():
sheet = sheet.columns.str.strip()
#print(sheet)
new_excel[name] = sheet
print(new_excel)
output:
{'Sheet1': Index([ 'IDx', 'IDy', ..... ]}
at this point I am stuck. I can't do anything with the new_excel dict. It seems that I am accessing each data frame incorrectly and just get the Index object. I can't get my head around this issue.
When trying to concat with new_excel:
TypeError: cannot concatenate object of type '<class 'pandas.core.indexes.base.Index'>'; only Series and DataFrame objs are valid
Many thanks in advance!
does this work?
new_excel = {}
for name, sheet in excel.items():
new_columns = [str(col).strip() for col in sheet.columns]
sheet.columns = new_columns
new_excel[name] = sheet
No, sadly not. I have integers in my col names (Attribute Error: 'int' object has no attribute 'strip'). When trying to change the type of the columns I ran into trouble with key error and the indexes being messed up. I tried to circumvent this with another for loop but failed so far to get it to work. I am not very good with this
for name, sheet in excel.items():
for i in sheet:
if type(i) != int:
new_columns = [col.strip() for col in sheet.columns]
else:
new_columns = sheet.columns
print(new_columns)
sheet.columns = new_columns
new_excel[name] = sheet
still getting AttributeError: 'int' object has no attribute 'strip'
Error is clear but not the solution for me

Pandas. How to call data frame from the list of data frames?

I have a list of data frames:
all_df = ['df_0','df_1','df_2','df_3','df_4','df_5','df_6']
How can I call them from this list to do something like that:
for (df,names) in zip(all_df,names):
df.to_csv('output/{}.csv'.format(names))
When executing expectedly I'm getting error 'str' object has no attribute 'to_csv' since I'm giving string to the 'to_csv'.
How can I save several data-frames (or perform other actions on them) in the for loop?
Thanks!
Could you please also give an idea on how to create the 'right' list of data frames from this:
path = 'inp/multysheet_excel_01.xlsx'
xl = pd.ExcelFile(path)
sh_name = xl.sheet_names
n = 0
for i in sh_name:
exec('df_{} = pd.read_excel("{}", sheet_name="{}")'.format(n, path, i))
n+=1
so basically I'm trying to get the each sheet of an input excel as a separate dataframe, perform some actions on them, and save each output dataframe in separate excels.
You're quite close, but I see some mistakes in that for loop. Say you have a list of dataframes dfs, and its corresponding names as a list of strings names, you can save those dataframes using the names as:
dfs = [df_1, df_2, df_3]
names = ['df_0','df_1','df_2']
for df,name in zip(dfs,names):
df.to_csv('output\\{}.csv'.format(name))
Though if you only had a list of names, you could also do something like:
names = ['df_0','df_1','df_2']
for name in names:
globals()[name].to_csv('output\\{}.csv'.format(name))

how to set columns of pandas dataframe as list

I have a pandas dataframe and when I try to acess its columns (like df[["a"]) it is not possible because
the columns are defined as an "Index" object (pandas.core.indexes.base.Index). or Index(['col2','col2'], [![enter image description here][1]][1]dtype='object')
I tried convert it doing something like df.columns = df.columns.tolist() and also df.columns = [str(col) for col in df.columns]
but the columns remained as an Index object.
What I want is to make df.columns and it would return a list object.
What Can I do ?
columns is not callable. So, you need to remove the parenthesis ():
df.columns will give you the name of the columns as an object.
list(df.columns) will give you the name of the columns as a list.
In your example, list(ss.columns) will return a list of column names.
try this:
df.columns.values.tolist()
since you were trying to convert it using this approach, you missed the values attribute
You have to wrap it over list Constructor to function it like a list i.e list(ss.columns).
list(ss.columns)
Hope this works!

Pandas - Convert dictionary to dataframe - keys as columns

I have a folder with .csv files that contain timeseries in the following format:
1 0.950861
2 2.34248
3 2.56038
4 3.46226
...
I access these textfiles by looping over the folder containing the files and passing each textfile to a dictionary:
data_dict = {textfile: pd.read_csv(textfile, header=3, delim_whitespace=True, index_col=0) for textfile in textfiles}
I want to merge the columns, which contain the data next to each other with the dictionary keys as index (Pathname of the textfiles). They all have the same row number.
So far I tried passing the dictionary to a pd.Dataframe like this:
df = pd.DataFrame.from_dict(data_dict, orient='index')
Actually, the orientation needs to be the default 'columns', but it results in an error:
Value Error: If using all scalar values, you must pass an index
If I do so, I get the wrong result:
Excel_Output
This is how I pass the frame to Excel:
writer = pd.ExcelWriter("output.xls")
df.to_excel(writer,'data', index_label = 'data', merge_cells =False)
writer.save()
I think the error must be in passing the dictionary to the dataframe.
I tried pd.concat/merge/append but nothing returns the right result.
Thanks in Advance!
IIUC you can try list comprehension with concat:
data_list = [pd.read_csv(textfile, header=3, delim_whitespace=True, index_col=0)
for textfile in textfiles]
print (pd.concat(data_list, axis=1))

Categories

Resources