I have an existing code which uses pandas to read an excel workbook.
It reads all the sheets including the hidden ones however, I want to read only the visible ones. Tried using this, but it didn't help:
The below are the snippets tried out.
xlFile = '/path/file_name.xlsx'
xl = pd.ExcelFile(xlFile)
list_of_visible_sheets = []
sheets = xl.book.sheets()
for sheet in sheets:
if sheet.visibility == 0:
list_of_visible_sheets.append(sheets)
print(list_of_visible_sheets)
and
list_of_visible_sheets = []
sheets = xl.sheet_names
for sheet in sheets:
if sheet.visibility == 0:
list_of_visible_sheets.append(sheets)
print(list_of_visible_sheets)
How can I get the visible sheets alone?
You can use this code with openpyxl. It roughly does what the pandas read_excel() function does:
import openpyxl
import pandas as pd
filename = 'TestFile.xlsx'
#Number of rows to skip in each sheet
nSkip = 1
dfs = { ws.title : pd.DataFrame(list(ws.values)[nSkip:])
for ws in openpyxl.load_workbook(filename,read_only=True).worksheets
if ws.sheet_state != 'hidden' }
print(dfs)
Try to pass the sheetname argument to pandas.read_excel
If there are not too many sheets, you can create the desired list manually, or use the recipe from that answer and lambdas.
Related
I am trying to merge multiple .xls files that have many columns, but 1 column with hyperlinks. I try to do this with Python but keep running into unsolvable errors.
Just to be concise, the hyperlinks are hidden under a text section. The following ctrl-click hyperlink is an example of what I encounter in the .xls files: ES2866911 (T3).
In order to improve reproducibility, I have added .xls1 and .xls2 samples below.
xls1:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
.xls2:
Title
Publication_Number
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
Desired outcome:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
I am unable to get .xls file into Python without losing formatting or losing hyperlinks. In addition I am unable to convert .xls files to .xlsx. I have no possibility to acquire the .xls files in .xlsx format. Below I briefly summarize what I have tried:
1.) Reading with pandas was my first attempt. Easy to do, but all hyperlinks are lost in PD, furthermore all formatting from original file is lost.
2.) Reading .xls files with openpyxl.load
InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
3.) Converting .xls files to .xlsx
from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration
4.) Even if we are able to read the .xls file with xlrd for example (meaning we will never be able to save the file as .xlsx, I can't even see the hyperlink:
import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value
'AR118706 (A2)' #Which is the name, not hyperlink
5.) I tried installing older versions of openpyxl==3.0.1 to overcome type error to no succes. I tried to open .xls file with openpyxl with xlrd engine, similar typerror "xml.entree.elementtree.element' error occured. I tried many ways to batch convert .xls files to .xlsx all with similar errors.
Obviously I can just open with excel and save as .xlsx but this defeats the entire purpose, and I can't do that for 100's of files.
You need to use xlrd library to read the hyperlinks properly, pandas to merge all data together and xlsxwriter to write the data properly.
Assuming all input files have same format, you can use below code.
# imports
import os
import xlrd
import xlsxwriter
import pandas as pd
# required functions
def load_excel_to_df(filepath, hyperlink_col):
book = xlrd.open_workbook(file_path)
sheet = book.sheet_by_index(0)
hyperlink_map = sheet.hyperlink_map
data = pd.read_excel(filepath)
hyperlink_col_index = list(data.columns).index(hyperlink_col)
required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
data['hyperlinks'] = required_links
return data
# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'
# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)
# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
worksheet.write(0, i, col)
for i in range(m):
worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()
References:
reading hyperlinks - https://stackoverflow.com/a/7057076/17256762
pandas to_excel header formatting - Remove default formatting in header when converting pandas DataFrame to excel sheet
writing hyperlinks with xlsxwriter - https://xlsxwriter.readthedocs.io/example_hyperlink.html
Without a clear reproducible example, the problem is not clear. Assume I have two files called tmp.xls and tmp2.xls containing dummy data as in the two screenshots below.
Then pandas can easily, load, concatenate, and convert to .xlsx format without loss of hyperlinks. Here is some demo code and the resulting file:
import pandas as pd
f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')
f3 = pd.concat([f1, f2], ignore_index=True)
f3.to_excel('./f3.xlsx')
Inspired by #Kunal, I managed to write code that avoids using Pandas libraries. .xls files are read by xlrd, and written to a new excel file by xlwt. Hyperlinks are maintened, and output file was saved as .xlsx format:
import os
import xlwt
from xlrd import open_workbook
# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)
#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)
#Initialize header row, can be done with any file
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row
#Add rows from all files present in folder
for file in required_files:
old_file = open_workbook(directory+"/"+file, formatting_info=True)
old_sheet = old_file.sheet_by_index(0) #Define old sheet
hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
for row in range(1, old_sheet.nrows): #We need all rows except header row
if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row
for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
if col is 1: #We need to make an exception for column 2 being the hyperlinked column
click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
else: #If not hyperlinked column
new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell
new_file.save("random_directory/output_file.xlsx")
I assume the same as daedalus in terms of the excel files. Instead of pandas I use openpyxl to read and create a new excel file.
import openpyxl
wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')
wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')
csvDict = {}
# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
hyperlink_dict[ws1.cell(row=row, column=1).value] =
[ws1.cell(row=row, column=2).hyperlink.target,
ws1.cell(row=row, column=2).value]
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
hyperlink_dict[ws2.cell(row=row, column=1).value] =
[ws2.cell(row=row, column=2).hyperlink.target,
ws2.cell(row=row, column=2).value]
Now you have all the data so you can create a new workbook and save the values from the dict into it via opnenpyxl.
wb = Workbook(write_only=true)
ws = wb.create_sheet()
for irow in len(csvDict):
#use ws.append() to add the data from the csv.
wb.save('new_big_file.xlsx')
https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode
I need to process a number of excel files with different # of tabs and different names. I'm creating a function to load the files with pandas, loop over the sheets, and then return a data frame.
def process_file(file_name):
# just junk code - will use pandas
for sheet_name in file_name:
sheet_x = sheet_name
return sheet_x
sheet_1, sheet_2 = process_file(excel_file)
Because there are an unknown amount of sheets in each file, trying to create a variable for each one seems manual. If I wanted to return each sheet as a variable, whether it's 2 or 10 sheets, is there a way to do that instead of naming each one?
Use an array to store all of your sheets:
def process_file(file_name):
sheets = []
# just junk code - will use pandas
for sheet_name in file_name:
sheet_x = sheet_name
sheets.append(sheet_x)
return sheets
sheets_to_process = []
for excel_file in files:
sheets_to_process += process_file(excel_file)
I'm new to Python (and programming in general) and am running into a problem when writing data out to sheets in Excel.
I'm reading in an Excel file, performing a sum calculation on specific columns, and then writing the results out to a new workbook. Then at the end, it creates two charts based on the results.
The code works, except every time I run it, it creates new sheets with numbers appended to the end. I really just want it to overwrite the sheet names I provide, instead of creating new ones.
I'm not familiar enough with all the modules to understand all the options that are available. I've researched openpyxl, and pandas, and similar examples to what I'm trying to do either aren't easy to find, or don't seem to work when I try them.
import pandas as pd
import xlrd
import openpyxl as op
from openpyxl import load_workbook
import matplotlib.pyplot as plt
# declare the input file
input_file = 'TestData.xlsx'
# declare the output_file name to be written to
output_file = 'TestData_Output.xlsx'
book = load_workbook(output_file)
writer = pd.ExcelWriter(output_file, engine='openpyxl')
writer.book = book
# read the source Excel file and calculate sums
excel_file = pd.read_excel(input_file)
num_events_main = excel_file.groupby(['Column1']).sum()
num_events_type = excel_file.groupby(['Column2']).sum()
# create dataframes and write names and sums out to new workbook/sheets
df_1 = pd.DataFrame(num_events_main)
df_2 = pd.DataFrame(num_events_type)
df_1.to_excel(writer, sheet_name = 'TestSheet1')
df_2.to_excel(writer, sheet_name = 'TestSheet2')
# save and close
writer.save()
writer.close()
# dataframe for the first sheet
df = pd.read_excel(output_file, sheet_name='TestSheet1')
values = df[['Column1', 'Column3']]
# dataframe for the second sheet
df = pd.read_excel(output_file, sheet_name='TestSheet2')
values_2 = df[['Column2', 'Column3']]
# create the graphs
events_graph = values.plot.bar(x = 'Column1', y = 'Column3', rot = 60) # rot = rotation
type_graph = values_2.plot.bar(x = 'Column2', y = 'Column3', rot = 60) # rot = rotation
plt.show()
I get the expected results, and the charts work fine. I'd really just like to get the sheets to overwrite with each run.
From the pd.DataFrame.to_excel documentation:
Multiple sheets may be written to by specifying unique sheet_name.
With all data written to the file it is necessary to save the changes.
Note that creating an ExcelWriter object with a file name that already
exists will result in the contents of the existing file being erased.
Try writing to the book like
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3],'col2':[4,5,6]})
writer = pd.ExcelWriter('g.xlsx')
df.to_excel(writer, sheet_name = 'first_df')
df.to_excel(writer, sheet_name = 'second_df')
writer.save()
If you inspect the workbook, you will have two worksheets.
Then lets say you wanted to write new data to the same workbook:
writer = pd.ExcelWriter('g.xlsx')
df.to_excel(writer, sheet_name = 'new_df')
writer.save()
If you inspect the workbook now, you will just have one worksheet named new_df
If there are other worksheets in the excel file that you want to keep and just overwrite the desired worksheets, you would need to use load_workbook.
Before you wrtie any data, you could delete the sheets you want to write to with:
std=book.get_sheet_by_name(<sheee_name>)
book.remove_sheet(std)
That will stop the behavior where a number gets appended to the worksheet name once you attempt to write a workbook with a duplicate sheet name.
I am trying to read different worksheets from an Excel workbook in Python with Pandas. When I read the entire workbook and then I want to apply a .merge() then the first worksheet is read but the others are not considered. I tried to read each worksheet of the workbook but I guess they were not successfully converted to data frames because when I apply .merge() I end up with the following error: ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>
This is what I have done so far:
This code works for converting the entire workbook to a data frame but only the data of the first worksheet is processed
import pandas as pd
import pypyodbc
#sql extractor
start_date = date.today()
retrieve_values = "[DEV].[CS].[QT_KPIExport] #start_date='{start_date:%Y-%m-%d}'".format(
start_date=start_date)
connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="X",pwd="xxx", Trusted_Connection="No")
data_frame_sql = pd.read_sql(retrieve_values, connection)
#Read the entire workbook
wb_data = pd.ExcelFile("C:\\Users\\Dev\\Testing\\Daily_Data\\NSN-Daily Data Report.xlsx")
#Convert to a dataframe the entire workbook
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#apply merge
merged_df = data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
This code tries to read the different worksheets and convert them to data frames with no success...yet! (check the answer below)
data_frame_sql = pd.read_sql(retrieve_values, connection)
#Method 1: Tried to parse worksheet 2
#Read the entire workbook and select the specific worksheet
wb_data = pd.ExcelFile("C:\\Users\\Dev\\Testing\\Daily_Data\\NSN-Daily Data Report.xlsx", sheetname="Sheet-2")
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#apply merge
merged_df = data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
#No success... the data of the first sheet is read
#Method 2: Tried to parse worksheet 2
#Read the entire workbook
wb_data = pd.ExcelFile("C:\\Users\\Dev\\Testing\\Daily_Data\\NSN-Daily Data Report.xlsx")
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#select one specific sheet
ws_sheet_2 = wb_data.parse("Sheet-2")
#apply merge
merged_df = data_frame_sql.merge(ws_sheet_2,how="inner",on="sectorname")
# No success.... ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>
Any help or advice is greatly appreciated.
You can get all worksheets from a workbook into a dictionary by using the sheetname=None argument with the read_excel method. Key/value pairs will be ws name/dataframe.
ws_dict = pd.read_excel('excel_file.xlsx', sheetname=None)
Note the sheetname argument will change to sheet_name in future pandas versions...
I found out a solution that did the trick.
#Method 1: Add the sheetname once you have read the entire workbook
#Read the entire workbook
wb_data = pd.ExcelFile("C:\\Users\\Dev\\Testing\\Daily_Data\\NSN-Daily Data
Report.xlsx")
#Select your sheetname to read
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=
['NA'],parse_cols="J" sheetname="Sheet-2")
#apply merge
merged_df =
data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
To read .xlsx files in Pandas, for a document with multiple sheets, specify the sheet name and use a different engine.
Step 1 (install the openpyxl package):
! pip install openpyxl
Step 2 (use the openpyxl engine):
data_df = pd.read_excel(<ARCHIVE_PATH>, sheetname= <sheet_name>, engine='openpyxl')
Here is the official documentation.
Another solution using openpyxl directly:
wb = load_workbook(ARCHIVE_PATH)
ws = wb[<sheet-name>]
data_df = pd.DataFrame(ws.values)
df_tm = sheet.values
coluna_tm = next(df_tm)[0:]
df = pd.DataFrame(df_tm, columns=coluna_tm)
I am using Python 3.4 and xlrd. I want to sort the Excel sheet based on the primary column before processing it. Is there any library to perform this ?
There are a couple ways to do this. The first option is to utilize xlrd, as you have this tagged. The biggest downside to this is that it doesn't natively write to XLSX format.
These examples use an excel document with this format:
Utilizing xlrd and a few modifications from this answer:
import xlwt
from xlrd import open_workbook
target_column = 0 # This example only has 1 column, and it is 0 indexed
book = open_workbook('test.xlsx')
sheet = book.sheets()[0]
data = [sheet.row_values(i) for i in xrange(sheet.nrows)]
labels = data[0] # Don't sort our headers
data = data[1:] # Data begins on the second row
data.sort(key=lambda x: x[target_column])
bk = xlwt.Workbook()
sheet = bk.add_sheet(sheet.name)
for idx, label in enumerate(labels):
sheet.write(0, idx, label)
for idx_r, row in enumerate(data):
for idx_c, value in enumerate(row):
sheet.write(idx_r+1, idx_c, value)
bk.save('result.xls') # Notice this is xls, not xlsx like the original file is
This outputs the following workbook:
Another option (and one that can utilize XLSX output) is to utilize pandas. The code is also shorter:
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort(columns="Header Row")
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,sheet_name='Sheet1',columns=["Header Row"],index=False)
writer.save()
This outputs:
In the to_excel call, the index is set to False, so that the Pandas dataframe index isn't included in the excel document. The rest of the keywords should be self explanatory.
I just wanted to refresh the answer as the Pandas implementation has changed a bit over time. Here's the code that should work now (pandas 1.1.2).
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort_values(by="Header Row")
...
The sort function is now called sort_by and columns is replaced by by.