I am reading an excel file with around 2000 sheets with pandas. The excel sheet is loaded as an Ordered Dictionary since I have used the following:
test = pd.read_excel('test.xlsx', sheet_name=None)
Let's assume that it looks like that:
I would like to modify the name of the sheets and save the Ordered Dictionary to an excel file again. The names of the sheets are stored as the keys of the Ordered Dictionary, so basically I would like to just modify the keys and save to excel again.
As it can be noticed the name of the key ends with a year, i.e. 2020, 2022 etc. I would like all the keys to be modified such that they are reduced by 1, so the name of the keys are now 2019, 2021 etc. I would also like to make sure that the content does not change; meaning that the dataframe that used to assigned to A.AA.XX2020 is now assigned to A.AA.XX2019. The "General" sheets does not have to be modified.
Since there are many sheets in the excel file, I would prefer an automated procedure.
I hope following will suit your needs:
import pandas as pd
# read Excel file
test = pd.read_excel('test.xlsx', sheet_name=None)
# get keys from dict without 'General'
keys = list(test.keys())
keys.remove('General')
# iterate over keys
for key in keys:
# get the old year
year_old = int(key[-4:])
# make the new year
year_new = year_old - 1
# create name for new key
key_new = key[:-4] + str(year_new)
# copy values from old key in new key and delete old key
test[key_new] = test.pop(key)
# write dataframes from dict in one Excel file with new sheet names
with pd.ExcelWriter('test_new.xlsx') as writer:
for sheet_name, df in test.items():
df.to_excel(writer, sheet_name=sheet_name)
Related
I'm using the pandas library in Python.
I've taken an excel file and stored the contents in a data frame by doing the following:
path = r"filepath"
sheets_dict = pd.read_excel(path,sheet_name=None)
As there was multiple sheets, each containing a table of data with identical columns, I used pd.read_excel(path,sheet_name=None). This stored all the individual sheets into a dictionary with the key for each value/sheet being the sheet name.
I now what to unpack the dictionary and place each sheet into a single data frame. I want to use the key of each sheet in the dictionary as either part of a mulitindex so I know what key/sheet of each table came from or appended as a new column which gives me the key/sheet name for each unique subset of the dataframe.
I've tried the following:
for k,df in sheets_dict.items():
df = pd.concat([pd.DataFrame(df)])
df['extract'] = k
However I'm not getting the results I want.
Any suggestions?
you can use the keys argument in pd.concat which will set the keys of your dict as the index.
df = pd.concat(sheets_dict.values(),keys=sheets_dict.keys())
by default, pd.concat(sheet_dict) will set the indices as the keys.
I have one excel file with several identical structured sheets on it (same headers and number of columns) (sheetsname: 01,02,...,12).
How can I get this into one dataframe?
Right now I would load it all seperate with:
df1 = pd.read_excel('path.xls', sheet_name='01')
df2 = pd.read_excel('path.xls', sheet_name='02')
...
and would then concentate it.
What is the most pythonic way to do it and get directly one dataframe with all the sheets? Also assumping I do not know every sheetname in advance.
read the file as:
collection = pd.read_excel('path.xls', sheet_name=None)
combined = pd.concat([value.assign(sheet_source=key)
for key,value in collection.items()],
ignore_index=True)
sheet_name = None ensures all the sheets are read in.
collection is a dictionary, with the sheet_name as key, and the actual data as the values. combined uses the pandas concat method to get you one dataframe. I added the extra column sheet_source, in case you need to track where the data for each row comes from.
You can read more about it on the pandas doco
you can use:
df_final = pd.concat([pd.read_excel('path.xls', sheet_name="{:02d}".format(sheet)) for sheet in range(12)], axis=0)
I have an excel file with multiple sheets that I convert into a dictionary of dataframes where the key represents the sheet's name:
xl = pd.ExcelFile(r"D:\Python Code\PerformanceTable.xlsx")
pacdict = { i : pd.read_excel(xl, i) for c, i in enumerate(xl.sheet_names, 1)}
I would like to replace this input Excel file with a flat text file -- but would still like to end up with the same outcome of a dictionary of dataframes.
Any suggestions on how I might be able to format the text file so it still contains data for multiple, named tables/sheets and can be read into the above format? Preferrably still making Pandas' built-in functionality do the heavy lifting.
Loop through each sheet. Create a new column called "sheet_source". Concatenate the sheet dataframes to a master dataframe. Lastly export to CSV file.
# create a master dataframe to store the sheets
df_master = pd.DataFrame()
# loop through each dict key
for each_df_key in pacdict.keys():
# dataframe for each sheet
sheet_df = pacdict[each_df_key]
# add column for sheet name
sheet_df['sheet_source'] = each_df_key
# concatenate each sheet to the master
df_master = pd.concat([df_master, sheet_df])
# after the for-loop, export the master dataframe to CSV
df_master.to_csv('new_dataframe.csv', index=False)
I've used Pandas to read an excel sheet that has two columns used create a key, value dictionary. When ran, the code will search for a key, and produce it's value. Ex: WSO-Exchange will be equal to 52206.
Although, when I search for 59904-FX's value, it returns 35444 when I need it to return 22035; It only throws this issue when a key is also a value later on. Any ideas on how I can fix this error? I'll attach my code below, thanks!
MapDatabase = {}
for i in Mapdf.index:
MapDatabase[Mapdf['General Code'][i]] = Mapdf['Upload Code'][i]
df["AccountID"][i] is reading another excel sheet to search if that cell is in the dictionary's keys, and if it is, then to change it to it's value.
for i in df.index:
for key, value in MapDatabase.items():
if str(df['AccountId'][i]) == str(key):
df['AccountId'][i] = value
I would just use the xlrd library to do this:
from xlrd import open_workbook
workbook = open_workbook("data.xlsx")
sheet = workbook.sheet_by_index(0)
data = {sheet.cell(row, 0).value: sheet.cell(row, 1).value for row in range(sheet.nrows)}
print(data)
Which gives the following dictionary:
{'General Code': 'Upload Code', '59904-FX': 22035.0, 'WSO-Exchange': 52206.0, 56476.0: 99875.0, 22035.0: 35444.0}
Check the Index for Duplicates in Your Excel File
Most likely the problem is that you are iterating over a non-unique index of the DataFrame Mapdf. Check that the first column in the Excel file you are using to build Mapdf is unique per row.
Don't Iterate Over DataFrame Rows
However, rather than trying to iterate row-wise over a DataFrame (which is almost always the wrong thing to do), you can build the dictionary with a call to the dict constructor, handing it an iterable of (key, value) pairs:
MapDatabase = dict(zip(Mapdf["General Code"], Mapdf["Upload Code"]))
Consider Merging Rather than Looping
Better yet, what you are doing seems like an ideal candidate for DataFrame.merge.
It looks like what you want to do is overwrite the values of AccountId in df with the values of Upload Code in Mapdf if AccountId has a match in General Code in Mapdf. That's a mouthful, but let's break it down.
First, merge Mapdf onto df by the matching columns (df["AccountId"] to Mapdf["General Code"]):
columns = ["General Code", "Upload Code"] # only because it looks like there are more columns you don't care about
merged = df.merge(Mapdf[columns], how="left", left_on = "AccountId", right_on="General Code")
Because this is a left join, rows in merged where column AccountId does not have a match in Mapdf["General Code"] will have missing values for Upload Code. Copy the non-missing values to overwrite AccountId:
matched = merged["Upload Code"].notnull()
merged.loc[matched, "AccountId"] = merged.loc[matched, "Upload Code"]
Then drop the extra columns if you'd like:
merged.drop(["Upload Code", "General Code"], axis="columns", inplace=True)
EDIT: Turns out I didn't need to do a nested for loop. The solution was to go from a for loop to an if statment.
for i in df.index:
if str(df['AccountId'][i]) in str(MapDatabase.items()):
df.at[i, 'AccountId'] = MapDatabase[df['AccountId'][i]]
I have a dataframe with 101 columns and I want to look at the distribution of each variable in my DataFrame. Using Pandas value_counts I have created a dictionary of several series with multiple lengths. Each series has its own key.
First I do:
out={}
for c in df.columns:
out[c]=df[c].value_counts(dropna=False).fillna(0)
So, out is a dictionary with size 101. Within out is a list of series with various sizes.
Key | Type | Size | Value
Key1 Series (12,) class 'pandas.core.series.Series'
Key2 Series (7,) class 'pandas.core.series.Series'
Key3 Series (24,) class 'pandas.core.series.Series'
.
.
.
Key101
Each Key is unique. I want to save each of these series to an Excel file. This answer is close and will work for the first key in the loop, but it won't continue on to the next key in the dictionary. This is what I have now:
for key in out.keys():
s=out[key]
name=key[:30]
s.to_excel('xlfile.xlsx', sheet_name=name)
I only keep the first 30 characters because that is the limit in excel for a sheet name. I don't necessary need them to have their own sheet, I would rather them all be saved to a single sheet by column, but this is the closest I can get to saving them. Obviously a newbie, so if there is a better approach to my fundamental question, that would be awesome too.
I'm open to any suggestions, thanks for your time.
Need ExcelWriter:
writer = pd.ExcelWriter('xlfile.xlsx')
for key in out.keys():
out[key].to_excel(writer, sheet_name=key[:30])
writer.save()
Or if is not necessary create dict, write it in first loop (also advantage is same order of sheet names as columns of DataFrame):
writer = pd.ExcelWriter('xlfile.xlsx')
for c in df.columns:
df[c].value_counts(dropna=False).fillna(0).to_excel(writer, sheet_name=c[:30])
writer.save()
You need to create an excel writer and pass that into the loop i.e.
>>> writer = pd.ExcelWriter('output.xlsx')
>>> df1.to_excel(writer,'Sheet1')
>>> df2.to_excel(writer,'Sheet2')
>>> writer.save()
Refer here for more