Concat Read Excel Pandas

Concat Read Excel Pandas - python

I'm needing to read in an excel file and read all sheets inside that excel file.
I've tried:
sample_df = pd.concat(pd.read_excel("sample_master.xlsx", sheet_name=None), ignore_index=True)
This code worked, but it's suddenly giving me this error:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
After reading in the excel file, I need to run the following command:
new_id = sample_df.loc[(sample_df['Sequencing_ID'] == line) & (sample_df['Experiment_ID'] == experiment_id), \
'Sample_ID_for_report'].item()
Any help?

First, you will want to know all of the sheets that need to be read in. Second, you will want to iterate over each sheet.
Getting Sheet names.- You can get a list of the sheet names in a workbook with sheets = pd.ExcelFile(path).sheet_names, where path is the full path to your file. The function below reads a workbook and returns a list of sheet names that contain specific key words.
import re
import pandas as pd
def get_sheets(path):
sheets = pd.ExcelFile(path).sheet_names
sheets_to_process = []
for sheet in sheets:
excludes = ['exclude_term1', 'exclude_term1']
includes = ['find_term1', 'find_term2']
sheet_stnd = re.sub('[^0-9A-Za-z_]+', '', sheet).lower().strip(' ')
for exclude in excludes:
if sheet_stnd != exclude:
for include in includes:
if include in sheet_stnd:
sheets_to_process.append(sheet)
return list(set(sheets_to_process))
Loop over sheets- You can then loop over the sheets to read them in. In this example,
for sheet in get_sheets(path):
df = pd.concat(pd.read_excel("sample_master.xlsx", sheet_name=sheet),
ignore_index=True)
Depending on your use case, you may also want to append each sheet into a larger data frame

Related

How to merge multiple .xls files with hyperlinks in python?

I am trying to merge multiple .xls files that have many columns, but 1 column with hyperlinks. I try to do this with Python but keep running into unsolvable errors.
Just to be concise, the hyperlinks are hidden under a text section. The following ctrl-click hyperlink is an example of what I encounter in the .xls files: ES2866911 (T3).
In order to improve reproducibility, I have added .xls1 and .xls2 samples below.
xls1:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
.xls2:
Title
Publication_Number
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
Desired outcome:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
I am unable to get .xls file into Python without losing formatting or losing hyperlinks. In addition I am unable to convert .xls files to .xlsx. I have no possibility to acquire the .xls files in .xlsx format. Below I briefly summarize what I have tried:
1.) Reading with pandas was my first attempt. Easy to do, but all hyperlinks are lost in PD, furthermore all formatting from original file is lost.
2.) Reading .xls files with openpyxl.load
InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
3.) Converting .xls files to .xlsx
from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration
4.) Even if we are able to read the .xls file with xlrd for example (meaning we will never be able to save the file as .xlsx, I can't even see the hyperlink:
import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value
'AR118706 (A2)' #Which is the name, not hyperlink
5.) I tried installing older versions of openpyxl==3.0.1 to overcome type error to no succes. I tried to open .xls file with openpyxl with xlrd engine, similar typerror "xml.entree.elementtree.element' error occured. I tried many ways to batch convert .xls files to .xlsx all with similar errors.
Obviously I can just open with excel and save as .xlsx but this defeats the entire purpose, and I can't do that for 100's of files.

You need to use xlrd library to read the hyperlinks properly, pandas to merge all data together and xlsxwriter to write the data properly.
Assuming all input files have same format, you can use below code.
# imports
import os
import xlrd
import xlsxwriter
import pandas as pd
# required functions
def load_excel_to_df(filepath, hyperlink_col):
book = xlrd.open_workbook(file_path)
sheet = book.sheet_by_index(0)
hyperlink_map = sheet.hyperlink_map
data = pd.read_excel(filepath)
hyperlink_col_index = list(data.columns).index(hyperlink_col)
required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
data['hyperlinks'] = required_links
return data
# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'
# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)
# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
worksheet.write(0, i, col)
for i in range(m):
worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()
References:
reading hyperlinks - https://stackoverflow.com/a/7057076/17256762
pandas to_excel header formatting - Remove default formatting in header when converting pandas DataFrame to excel sheet
writing hyperlinks with xlsxwriter - https://xlsxwriter.readthedocs.io/example_hyperlink.html

Without a clear reproducible example, the problem is not clear. Assume I have two files called tmp.xls and tmp2.xls containing dummy data as in the two screenshots below.
Then pandas can easily, load, concatenate, and convert to .xlsx format without loss of hyperlinks. Here is some demo code and the resulting file:
import pandas as pd
f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')
f3 = pd.concat([f1, f2], ignore_index=True)
f3.to_excel('./f3.xlsx')

Inspired by #Kunal, I managed to write code that avoids using Pandas libraries. .xls files are read by xlrd, and written to a new excel file by xlwt. Hyperlinks are maintened, and output file was saved as .xlsx format:
import os
import xlwt
from xlrd import open_workbook
# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)
#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)
#Initialize header row, can be done with any file
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row
#Add rows from all files present in folder
for file in required_files:
old_file = open_workbook(directory+"/"+file, formatting_info=True)
old_sheet = old_file.sheet_by_index(0) #Define old sheet
hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
for row in range(1, old_sheet.nrows): #We need all rows except header row
if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row
for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
if col is 1: #We need to make an exception for column 2 being the hyperlinked column
click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
else: #If not hyperlinked column
new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell
new_file.save("random_directory/output_file.xlsx")

I assume the same as daedalus in terms of the excel files. Instead of pandas I use openpyxl to read and create a new excel file.
import openpyxl
wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')
wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')
csvDict = {}
# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
hyperlink_dict[ws1.cell(row=row, column=1).value] =
[ws1.cell(row=row, column=2).hyperlink.target,
ws1.cell(row=row, column=2).value]
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
hyperlink_dict[ws2.cell(row=row, column=1).value] =
[ws2.cell(row=row, column=2).hyperlink.target,
ws2.cell(row=row, column=2).value]
Now you have all the data so you can create a new workbook and save the values from the dict into it via opnenpyxl.
wb = Workbook(write_only=true)
ws = wb.create_sheet()
for irow in len(csvDict):
#use ws.append() to add the data from the csv.
wb.save('new_big_file.xlsx')
https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode

Process multiple excel files with different number of sheets

I need to process a number of excel files with different # of tabs and different names. I'm creating a function to load the files with pandas, loop over the sheets, and then return a data frame.
def process_file(file_name):
# just junk code - will use pandas
for sheet_name in file_name:
sheet_x = sheet_name
return sheet_x
sheet_1, sheet_2 = process_file(excel_file)
Because there are an unknown amount of sheets in each file, trying to create a variable for each one seems manual. If I wanted to return each sheet as a variable, whether it's 2 or 10 sheets, is there a way to do that instead of naming each one?

Use an array to store all of your sheets:
def process_file(file_name):
sheets = []
# just junk code - will use pandas
for sheet_name in file_name:
sheet_x = sheet_name
sheets.append(sheet_x)
return sheets
sheets_to_process = []
for excel_file in files:
sheets_to_process += process_file(excel_file)

Read excel files from a folder, extract a cell from sheet-1 and append values to a new excel sheet

I have around 100 excel files in a folder. I need to extract a cell, say name column D6 from the sheet-1 of the excel files and output the same to a new excel file/sheet. I have a followed a few SO questions but have not been able to find the desired output. This is what my issue is when I run the below program`
TypeError: cannot concatenate a non-NDFrame object
`
import os
import pandas as pd
import xlrd
import xlwt
files = os.listdir(path)
files
all_data = pd.DataFrame()
for file in files:
wb = xlrd.open_workbook(file)
sheet = wb.sheet_by_index(0)
df = sheet.cell_value(5,3)
all_data.append(df,ignore_index=True)
writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter')
all_data.to_excel(writer,'sheet1')
writer.save()

Your error says that you can only concat a dataframe with another dataframe. when you read the cell with xlrd you don't get a df-object. so either make the single cell a dataframe or store it temorarily and make the dataframe afterwards.
something like this (untested) should do it.
all_data = [] # list
for file in files:
df = pd.read_excel(file, sheetname='sheet-1')
all_data.append(df.iloc[5,3])
all_data = pd.DataFrame(all_data) # dataframe
all_data.to_excel('all_data.xlsx')
or one could use other libraries as well to make the same, like openpyxl for example.

how to append columns in existing excel sheet using panda in python

import pandas as pd
from pandas import ExcelWriter
trans=pd.read_csv('HMIS-DICR-2011-12-Manipur-Bishnupur.csv')
df=trans[["April 10-11","May 10-11","June 10-11","July 10-11","August 10-11","September 10-11","October 10-11","November 10-11","December 10-11","January 10-11","February 10-11","March 10-11","April 11-12","May 11-12","June 11-12","July 11-12","August 11-12","September 11-12","October 11-12","November 11-12","December 11-12","January 11-12","February 11-12","March 11-12"]]
writer1 = ExcelWriter('manipur1.xlsx')
df.to_excel(writer1,'Sheet1',index=False)
writer1.save()
this code successfully writes the data in a sheet 1 but how can append data of another data frame(df) from different excel file(mention below) into existing sheet(sheet1) "manipur1" excel file
for example:
my data frame is like:
trans=pd.read_csv('HMIS-DICR-2013-2014-Manipur-Bishnupur.csv')
df=trans[["April 12-13","May 12-13","June 12-13","July 12-13","August 12-13","September 12-13","October 12-13","November 12-13","December 12-13","January 12-13","February 12-13","March 12-13","April 13-14","May 13-14","June 13-14","July 13-14","August 13-14","September 13-14","October 13-14","November 13-14","December 13-14","January 13-14","February 13-14","March 13-14"]]

You can only append new data to an existing excel file while loading the existing data into pandas, appending the new data, and saving the concatenated data frame again.
To preserve existing sheets which are supposed to remain unchanged, you need to iterate over the entire workbook and handle each sheet. Sheets to be changed and appended are defined in the to_update dictionary.
# get data to be appended
trans=pd.read_csv('HMIS-DICR-2011-12-Manipur-Bishnupur.csv')
df_append = trans[["April 12-13","May 12-13","June 12-13","July 12-13","August 12-13","September 12-13","October 12-13","November 12-13","December 12-13","January 12-13","February 12-13","March 12-13","April 13-14","May 13-14","June 13-14","July 13-14","August 13-14","September 13-14","October 13-14","November 13-14","December 13-14","January 13-14","February 13-14","March 13-14"]]
# define what sheets to update
to_update = {"Sheet1": df_append}
# load existing data
file_name = 'manipur1.xlsx'
excel_reader = pd.ExcelFile(file_name)
# write and update
excel_writer = pd.ExcelWriter(file_name)
for sheet in excel_reader.sheet_names:
sheet_df = excel_reader.parse(sheet)
append_df = to_update.get(sheet)
if append_df is not None:
sheet_df = pd.concat([sheet_df, append_df], axis=1)
sheet_df.to_excel(excel_writer, sheet, index=False)
excel_writer.save()
However, any layouting/formatting in your existing excel will be lost. You can use openpyxl if you want to retain the formatting but this is more complicated.

Writing multiple pandas dataframes to multiple excel worksheets

I'd like for the code to run 12345 thru the loop, input it in a worksheet, then start on 54321 and do the same thing except input the dataframe into a new worksheet but in the same workbook. Below is my code.
workbook = xlsxwriter.Workbook('Renewals.xlsx')
groups = ['12345', '54321']
for x in groups:
(Do a bunch of data manipulation and get pandas df called renewals)
writer = pd.ExcelWriter('Renewals.xlsx', engine='xlsxwriter')
worksheet = workbook.add_worksheet(str(x))
renewals.to_excel(writer, sheet_name=str(x))
When this runs, I am left with a workbook with only 1 worksheet (54321).

try something like this:
import pandas as pd
#initialze the excel writer
writer = pd.ExcelWriter('MyFile.xlsx', engine='xlsxwriter')
#store your dataframes in a dict, where the key is the sheet name you want
frames = {'sheetName_1': dataframe1, 'sheetName_2': dataframe2,
'sheetName_3': dataframe3}
#now loop thru and put each on a specific sheet
for sheet, frame in frames.iteritems(): # .use .items for python 3.X
frame.to_excel(writer, sheet_name = sheet)
#critical last step
writer.save()

import pandas as pd
writer = pd.ExcelWriter('Renewals.xlsx', engine='xlsxwriter')
renewals.to_excel(writer, sheet_name=groups[0])
renewals.to_excel(writer, sheet_name=groups[1])
writer.save()

Building on the accepted answer, you can find situations where the sheet name will cause the save to fail if it has invalid characters or is too long. This could happen if you are using grouped values for the sheet name as an example. A helper function could address this and save you some pain.
def clean_sheet_name(sheet):
"""Clean sheet name so that it is a valid Excel sheet name.
Removes characters in []:*?/\ and limits to 30 characters.
Args:
sheet (str): Name to use for sheet.
Returns:
cleaned_sheet (str): Cleaned sheet name.
"""
if sheet in (None, ''):
return sheet
clean_sheet = sheet.translate({ord(i): None for i in '[]:*?/\\'})
if len(clean_sheet) > 30: # Set value you feel is appropriate
clean_sheet = clean_sheet[:30]
return clean_sheet
Then add a call to the helper function before writing to Excel.
for sheet, frame in groups.items():
# Clean sheet name for length and invalid characters
sheet = clean_sheet_name(sheet)
frame.to_excel(writer, sheet_name = sheet, index=False)
writer.save()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concat Read Excel Pandas - python

Related

How to merge multiple .xls files with hyperlinks in python?

Process multiple excel files with different number of sheets

Read excel files from a folder, extract a cell from sheet-1 and append values to a new excel sheet

how to append columns in existing excel sheet using panda in python

Writing multiple pandas dataframes to multiple excel worksheets

Categories

Resources