Remove special characters from dataframe names - python

In Python, I'm reading an excel file with multiple sheets, with the intention of each sheet being its own dataframe:
df = pd.read_excel('Book1.xlsx', sheet_name=None)
So to get the dictionary keys to each dataframe (or sheet) I can use: df.keys() which gives me each sheet name from the original Excel file: dict_keys(['GF-1', 'H_2 S-Z', 'GB-SF+NZ'])
I can then assign each dictionary into its own dataframe using:
for key in df.keys():
globals()[key] = df[key]
But, because the sheet names from the original Excel file contain special characters ( -, spaces, + etc), I can't call up any of the dataframes individually:
H_2 S-Z.head()
^
SyntaxError: invalid syntax
I know that dataframe 'names' cannot contain special characters or start with numbers etc, so how do I remove those special characters?
I don't think the dict_keys can be edited (e.g. using regex). Also thought about creating a list of the dataframes, then perhaps doing a regex for loop to iterate over each dataframe name, but not sure that it would assign the 'new' dataframe name back to each dataframe.
Can anyone help me?

You can use re.sub with a dictcomp to get rid of the characters (-, +, whitespace, ..) :
import re
dict_dfs = pd.read_excel("Book1.xlsx", sheet_name=None)
dict_dfs = {re.sub(r"[-+\s]", "_", k): v for k,v in dict_dfs.items()}
for key in dict_dfs.keys():
globals()[key] = dict_dfs[key]
As suggested by #cottontail, you can also use re.sub(r"\W", "_", k).
NB: As a result (in the global scope), you'll have as much variables (pandas.core.frame.DataFrame objects) as there is worksheets in your Excel file.
print([(var, type(val)) for var, val in globals().items()
if type(val) == pd.core.frame.DataFrame])
#[('GF-1', pandas.core.frame.DataFrame),
# ('H_2_S_Z', pandas.core.frame.DataFrame),
# ('GB_SF_NZ', pandas.core.frame.DataFrame)]

globals() is already a dictionary (you can confirm by isinstance(globals(), dict)), so the individual sheets can be accessed as any dict value:
globals()['H_2 S-Z'].head()
etc.
That being said, instead of creating individually named dataframes, I would think that storing the sheets as dataframes in a single dictionary may be more readable and accessible for you down the road. It's already creating problems given you cannot name your dataframes with the same name as the sheet names. If you change the dataframe names, then you'll need another mapping that tells you which sheet name corresponds to which dataframe name, so it's a lot of work tbh. As you already have a dictionary of dataframes in df, why not access the individual sheets by df['H_2 S-Z'] etc.?

Related

Creating lists of dataframes to iterate over those dataframes

In Python, I'm reading an Excel file with multiple sheets, to create each sheet as its own dataframe. I then want to create a list of those dataframes 'names' so i can use it in list comprehension.
The code below lets me create all the sheets into their own dataframe, and because some of the sheet names have special characters, I use regex to replace them with an underscore:
df = pd.read_excel('Book1.xlsx', sheet_name=None)
df = {re.sub(r"[-+\s]", "_", k): v for k,v in df.items()}
for key in df.keys():
globals()[key] = df[key]
Then, I can get a list of the dataframes by: all_dfs= %who_ls DataFrame,
giving me: ['GB_SF_NZ', 'GF_1', 'H_2_S_Z']
When I try the following code:
for df in all_dfs:
df.rename(columns=df.iloc[2], inplace=True)
I get the error 'AttributeError: 'str' object has no attribute 'rename'
But, if I manually type out the names of the dataframes into a list (note here that they are not in quotation marks):
manual_dfs = [GB_SF_NZ, GF_1, H_2_S_Z]
for df in manual_dfs:
df.rename(columns=df.iloc[2], inplace=True)
I don't get an error and the code works as expected.
Can anyone help me with why one works (without quotation) but the other (extracted with quotation) doesn't, and
how can I create a list of dataframes without the quote marks?
Following your code, you can do.:
for df in all_dfs:
globals()[df].rename(columns=df.iloc[2], inplace=True)

Setting variables/dictionary keys in python for loop to load multiple dataframes in pandas

I'm having trouble iterating through a for loop in Python with a dictionary, as I'm not quite familiar with the syntax.
Here is an approximation of my code:
import pandas as pd
data = {'q1':'q1data.xlsx','q2':'q2data.xlsx','q3':'q3data.xlsx'}
for name,link in data:
name=pd.read_excel(link,sheet_name=0,header=0)
What I am hoping for here is to end up with three dataframes, q1, q2, and q3. I'm either getting an error that there are too many values to unpack, or I end up with a single dataframe called 'name' from the last item in my dictionary.
Any ideas? Thanks!
You need to use .items() when iterating over dictionary key-value pairs. Without this, you iterate over the keys only, which means that you only have one value to unpack (one key at a time), as indicated by the error you mention.
data = {q1:'q1data.xlsx',q2:'q2data.xlsx',q3:'q3data.xlsx'}
for name, link in data.items():
...
You can't do this with variables in Python. I suggest to collect the DataFrames in a list and then to unpack them to a known number of variables:
files = ['q1data.xlsx', 'q2data.xlsx', 'q3data.xlsx']
q1, q2, q3 = [pd.read_excel(link, sheet_name=0, header=0)
for link in files]
Better: In order not to make assumptions on the number of DataFrames and to keep everything flexible, store the DataFrames in a dictionary:
data = {'q1':'q1data.xlsx','q2':'q2data.xlsx','q3':'q3data.xlsx'}
dataframes = {name: pd.read_excel(link, sheet_name=0, header=0)
for name, link in data.items()}
#use like
dataframes["q1"]

How to use a variable name as string in Python

In Python, I've created a bunch of dataframes like so:
df1 = pd.read_csv("1.csv")
...
df50 = pd.read_csv("50.csv") # import modes may vary based on the csv, no real way to shorten this
For every dataframe, I'd like to perform an operation which requires assigning a string as a name. For instance, given an existing database db,
df1.to_sql("df1", db) # and so on.
The dataframes may have a non-sequential name, so I can't do for i in range(1,51): "df"+str(i).
I'm looking for the right way to do this, instead of repeating the line 50 times. My idea was something like
for df in [df1, df2... df50]:
df.to_sql(df.__name__, db) # but dataframes don't have a __name__
How do I get the string "df1" from the dataframe I've called df1?
Is there an even nicer way to do all this?
Since the name appears to have been created following a pattern in the first place, just use code to replicate that pattern:
for i, df in enumerate([df1, df2... df50]):
df.to_sql(f'df{i}', db)
(Better yet, don't have those variables in the first place; create the list directly.)
The dataframes may have a non-sequential name, so I can't do for i in range(1,51): "df"+str(i).
Oh. Well in that case, if you want to associate textual names with the objects, that don't follow a pattern, that is what a dict is for:
dfs = {
"df1": pd.read_csv("1.csv"),
# whichever other names and values make sense
}
which you can iterate over easily:
for name, df in dfs.items():
df.to_sql(name, db)
If there is a logical rule that relates the input filename to the one that should be used for the to_sql call, you can use a dict comprehension to build the dict:
dfs = {to_sql_name(csv_name): pd.read_csv(csv_name) for csv_name in ...}
Or do the loading and processing in the same loop:
for csv_name in ...:
pd.read_csv(csv_name).to_sql(to_sql_name(csv_name), db)

Extract multiple dataframes from dictionary with Python

I'm using the pandas library in Python.
I've taken an excel file and stored the contents in a data frame by doing the following:
path = r"filepath"
sheets_dict = pd.read_excel(path,sheet_name=None)
As there was multiple sheets, each containing a table of data with identical columns, I used pd.read_excel(path,sheet_name=None). This stored all the individual sheets into a dictionary with the key for each value/sheet being the sheet name.
I now what to unpack the dictionary and place each sheet into a single data frame. I want to use the key of each sheet in the dictionary as either part of a mulitindex so I know what key/sheet of each table came from or appended as a new column which gives me the key/sheet name for each unique subset of the dataframe.
I've tried the following:
for k,df in sheets_dict.items():
df = pd.concat([pd.DataFrame(df)])
df['extract'] = k
However I'm not getting the results I want.
Any suggestions?
you can use the keys argument in pd.concat which will set the keys of your dict as the index.
df = pd.concat(sheets_dict.values(),keys=sheets_dict.keys())
by default, pd.concat(sheet_dict) will set the indices as the keys.

Iteration through dictionary pulling the wrong value for searched key

I've used Pandas to read an excel sheet that has two columns used create a key, value dictionary. When ran, the code will search for a key, and produce it's value. Ex: WSO-Exchange will be equal to 52206.
Although, when I search for 59904-FX's value, it returns 35444 when I need it to return 22035; It only throws this issue when a key is also a value later on. Any ideas on how I can fix this error? I'll attach my code below, thanks!
MapDatabase = {}
for i in Mapdf.index:
MapDatabase[Mapdf['General Code'][i]] = Mapdf['Upload Code'][i]
df["AccountID"][i] is reading another excel sheet to search if that cell is in the dictionary's keys, and if it is, then to change it to it's value.
for i in df.index:
for key, value in MapDatabase.items():
if str(df['AccountId'][i]) == str(key):
df['AccountId'][i] = value
I would just use the xlrd library to do this:
from xlrd import open_workbook
workbook = open_workbook("data.xlsx")
sheet = workbook.sheet_by_index(0)
data = {sheet.cell(row, 0).value: sheet.cell(row, 1).value for row in range(sheet.nrows)}
print(data)
Which gives the following dictionary:
{'General Code': 'Upload Code', '59904-FX': 22035.0, 'WSO-Exchange': 52206.0, 56476.0: 99875.0, 22035.0: 35444.0}
Check the Index for Duplicates in Your Excel File
Most likely the problem is that you are iterating over a non-unique index of the DataFrame Mapdf. Check that the first column in the Excel file you are using to build Mapdf is unique per row.
Don't Iterate Over DataFrame Rows
However, rather than trying to iterate row-wise over a DataFrame (which is almost always the wrong thing to do), you can build the dictionary with a call to the dict constructor, handing it an iterable of (key, value) pairs:
MapDatabase = dict(zip(Mapdf["General Code"], Mapdf["Upload Code"]))
Consider Merging Rather than Looping
Better yet, what you are doing seems like an ideal candidate for DataFrame.merge.
It looks like what you want to do is overwrite the values of AccountId in df with the values of Upload Code in Mapdf if AccountId has a match in General Code in Mapdf. That's a mouthful, but let's break it down.
First, merge Mapdf onto df by the matching columns (df["AccountId"] to Mapdf["General Code"]):
columns = ["General Code", "Upload Code"] # only because it looks like there are more columns you don't care about
merged = df.merge(Mapdf[columns], how="left", left_on = "AccountId", right_on="General Code")
Because this is a left join, rows in merged where column AccountId does not have a match in Mapdf["General Code"] will have missing values for Upload Code. Copy the non-missing values to overwrite AccountId:
matched = merged["Upload Code"].notnull()
merged.loc[matched, "AccountId"] = merged.loc[matched, "Upload Code"]
Then drop the extra columns if you'd like:
merged.drop(["Upload Code", "General Code"], axis="columns", inplace=True)
EDIT: Turns out I didn't need to do a nested for loop. The solution was to go from a for loop to an if statment.
for i in df.index:
if str(df['AccountId'][i]) in str(MapDatabase.items()):
df.at[i, 'AccountId'] = MapDatabase[df['AccountId'][i]]

Categories

Resources