Combine multiple excel workbooks into one with multiple sheets - python

I have over 50 workbooks that I want to combine into one workbook as 50 sheets, with formatting, coloring, filling, etc. still in tact.
This is what I tried:
import pandas as pd
import os
from openpyxl import load_workbook
from openpyxl import Workbook
path = "mypath"
directory = os.listdir(f'{path}')
files = [f for f in directory if f[-4:] == 'xlsx']
combined = Workbook()
ws = combined.active
for item in files:
wb = load_workbook(filename = f'{path}/{item}')
sheet = wb.sheetnames
data = pd.read_excel(f'{path}/{item}',sheet_name=f'{sheet[0]}')
data.to_excel(f'{path}/combined.xlsx',sheet_name=f'{sheet[0]}',header=None,index=None)
There were a couple issues with the result:
1. It overwrote the sheet each iteration, so the final workbook had 1 sheet, with information of only the last workbook.
2. The sheet did not retain the formatting
I'm essentially trying to copy each sheet into one workbook as I would using Excel's copy sheet command without having to do it 50 times.

Related

Copy Data from different workbook into Master Workbook with Python

I have to copy data from different workbooks and paste it into a master workbook. All the workbooks are located in a folder: C:\Users\f65651\data transfer. The copied data should be merged into one and then overwritten into the Master wkbk cells. Subsequently also, data from updated workbooks should be overwritten in the Master wkbk.
After some help, I have been able to incorporate all the excel workbooks together
import openpyxl as xl
import os
path1 ='C:\\Users\\f65651\Rresult.xlsx' #Master workbook
wb1 = xl.load_workbook(filename=path1)
ws1 = wb1.worksheets[0]
#iterating over the workbooks
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
g= os.path.join(directory, filename)
f =xl.load_workbook(filename=g)
f1 = f.worksheets[0]
print (filename, f1)
for row in f1:
values=[cell.value for cell in row]
ws1.append(values)
wb1.save(path1)
print ('Process finished!')
However with this code above, the data is appended under the Master wkbk existing table format instead of being overwritten directly into the cells
I have tried fixing this issue but i dont know how. I feel i am not doing the copying of the workbooks into the Master wkbk right. I also dont want to lose the formatting in the Master sheet. Please help!
For better understanding of the problem, I have attached a snippet of what i am trying to achieve, Data 1&2 are examples of the workbks and the Result file is the master sheet.
https://i.stack.imgur.com/0G4lM.png
from openpyxl import load_workbook
import os
directory = "workbooks"
master = Workbook()
master_sheet = master.active
master_sheet.title = "master_sheet"
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
file_path = os.path.join(directory, filename)
sheet = load_workbook(file_path).active
# Read each column's value of each excel sheet starting from row 3
for index, row in enumerate(sheet.iter_rows()):
if (index <= 1):
for cell in row:
master_sheet[cell.coordinate].value = cell.value
else:
row_dict = {cell.coordinate[:1]:cell.value for cell in row}
master_sheet.append(row_dict)
master.save("sheet3.xlsx")

How to create multiple excel sheets through Python?

I have an optimization problem that runs in a for loop. I want the results of each new iteration to be saved in a different tab in the same workbook.
This is what I'm doing. Instead of giving me multiple tabs in the same workbook, I'm getting multiple workbooks.
from openpyxl import Workbook
wb1 = Workbook()
for i in range(n):
ws = wb1.active()
ws.title = str(i)
#code on formatting sheet, optimization problem
wb1.save('outfile'+str(i)+'.xlsx')
Every iteration you are grabbing the same worksheet - ws = wb1.active() - and then simply saving your results to a different workbook.
You simply need to create a new sheet on each iteration. Something like this:
from openpyxl import Workbook
wb1 = Workbook()
for i in range(n):
ws = wb1.create_sheet("run " + str(i))
#code on formatting sheet, optimization problem
wb1.save('outfile.xlsx')
Notice that the save is indented out to simply save the file once all worksheets have been formatted. It is not necessary to save on each iteration. The saving operation can take time, especially when adding more tabs.
This code will create Excel Workbook containing worksheets same as the number of strings in a text file taken as the input. Here i have a text file named 'sample.txt' having 3strings. This code will so create 3 worksheets in a workbook named 'reformatted.data.xls'.
Also i have removed the default worksheets that get created automatically when the workbook object is created.
import xlwt
from openpyxl import Workbook
wb1 = Workbook()
row = 0
f = open('C:\Desktop\Mytestcases\sample.txt')
lines = f.readlines()
for i in range(len(lines)):
ws = wb1.create_sheet("worksheet" + str(i))
ws.cell(row=1, column=1).value = lines[i]
row += 1
sheet = wb1.get_sheet_by_name('Sheet')
wb1.remove_sheet(sheet)
wb1.save('reformatted.data.xls')

Write CSV content to Excel produces empty sheets

Writing DataFrame to excel file leaves sheets with zero data.
I am creating a Robotics "Scouting application". It receives multiple .csv files throughout the coarse of two days. The csv files will be named with a four digit number plus a hyphen and then a match number. For example "2073-18.csv". Multiple files for each team will arrive. I need one sheet for each team with the content of each csv file on the same sheet for that team. Creating the sheets works, writing the data to these sheets doesn't.
import os
import glob
import csv
from xlsxwriter.workbook import Workbook
import pandas as pd
import numpy as np
#from sqlalchemy import create_engine
from openpyxl import load_workbook
os.chdir ("/EagleScout")
path = '.'
extension = 'csv'
engine = 'xlsxwriter'
files_in_dir = [ f for f in glob.glob('*.csv')]
workbook = Workbook('Tournament.xlsx')
with pd.ExcelWriter('Tournament.xlsx') as writer:
for csvfile in files_in_dir:
df = pd.read_csv(csvfile)
fName, fExt = (os.path.splitext(csvfile))
sName = fName.split('-')
worksheet = workbook.get_worksheet_by_name(sName [0])
if worksheet is None:
worksheet = workbook.add_worksheet(sName [0]) #workseet with csv file name
df.to_excel(writer, sheet_name = (sName[0]))
writer.save()
workbook.close()
What I need is one workbook with one sheet for each team, up to 70 teams. Each sheet will have multiple rows, one for each csv file that arrived for that team. The question is, how do I get Pandas, or other libraries, to write the content of each csv file to it's appropriate sheet in the workbook?
OK, with the input from #ivan_pozdeev, I finally got past my issues.
Remember, my original desire was to generate a script that could be run on a regular basis and generate a spreadsheet with multiple worksheets. Each worksheet would contain all the data from the .csv files for every match that had played, and grouped by the team number.
I have also added a single spreadsheet that contains the raw data.
Here is what I came up with:
import os
import glob
import csv
import xlsxwriter
from xlsxwriter.workbook import Workbook
import pandas as pd
import numpy as np
#from sqlalchemy import create_engine
#import openpyxl
#from openpyxl import load_workbook
os.chdir ("/EagleScout")
path = '.'
extension = 'csv'
# Remove the combined .csv file from previous runs
#This will provide clean date without corruption from earlier runs
if os.path.exists('./Spreadsheets/combined.csv'):
os.remove ('./Spreadsheets/combined.csv')
#Remove previous Excel spreadsheet
if os.path.exists('./Spreadsheets/Tournament.xlsx'):
os.remove ('./Spreadsheets/Tournament.xlsx')
#Remove sorted combined csv
#Remove previous Excel spreadsheet
if os.path.exists('./Spreadsheets/Combined.xlsx'):
os.remove ('./Spreadsheets/Combined.xlsx')
#Read in and merge all .CSV file names
files_in_dir = [ f for f in glob.glob('*.csv')]
#Create a single combined .csv file with all data
#from all matches completed so far.
d1 = pd.read_csv('Header.txt')
d1.to_csv('./Spreadsheets/combined.csv', header = True, index = False)
for filenames in files_in_dir:
df = pd.read_csv(filenames)
fName, fExt = (os.path.splitext(filenames))
sName = fName.split('-')
N=(sName[1])
df.insert(0,N,N,True)
df.to_csv('./Spreadsheets/combined.csv', index_label = (sName[0]), mode = 'a')
#Combine all csv files into one master Raw Excel Data file
#and add column headers as labels
with pd.ExcelWriter('./Spreadsheets/Combined.xlsx') as writer:
dt = pd.read_csv('./Spreadsheets/combined.csv')
dt.to_excel(writer, sheet_name = 'All data')
writer.save()
#Parse through all .CSV files and append content to appropriate team worksheet.
with pd.ExcelWriter('./Spreadsheets/Tournament.xlsx') as writer:
df2 = pd.read_excel('./Spreadsheets/Combined.xlsx')
group = df2.groupby('Team')
for Team, Team_df in group:
Team_df.to_excel(writer, sheet_name = str(Team))
writer.save()
I am certain there is a cleaner way to do this code, I'm still new at this, but for now it does what I expect.

How to delete duplicate columns in multiple Excel sheets of one workbook?

I have multiple sheets in one Excel workbook with duplicated columns in each sheet. I need to delete the duplicates and to leave the original columns only.
I know how to drop duplicates within a sheet.
df_sheet_map['> Acute Hospital Bed SLM']
result2=df_sheet_map['> Acute Hospital Bed SLM'].T.drop_duplicates().T
dfList = []
path = 'J:/TestDup'
newpath = 'J:/TestDup/Test2'
for fn in os.listdir(path):
file = os.path.join(path, fn)
if os.path.isfile(file):
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(file)
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('Sheet1',header= None)
df_NoHeader = df[2:]
data = df_NoHeader
# Save individual dataframe
data.to_excel(os.path.join(newpath, fn))
dfList.append(data)
appended_data = pd.concat(dfList)
appended_data.to_excel(os.path.join(newpath, 'master_data.xlsx'))
The above code is working. However, I need to traverse all sheets. Further, it shows to delete first two rows, I need to change to delete duplicates.
#Transpose all sheets in a workbook. then delete duplicates. then Transpose back to original file and save all sheets
#Transpose all sheets in the workbook file
import pyexcel
import pyexcel_xlsx as pe
from pyexcel_xlsx import get_data
book = pyexcel.get_book(file_name="H:/SLM_Final/SLM Indicator template Main to clean.xlsx")
for sheet in book:
sheet.transpose()
pass
book.save_as("H:/SLM_Final/SLM Indicator template Main to clean.xlsx")
#run excel VB from python
import win32com.client as win32
import time
xl = win32.Dispatch('Excel.Application')
xl.Visible = 0
ss = xl.Workbooks.Open('H:/SLM_Final/DeleteDup.xlsm')
xl.Run("deleteDuplicate")
time.sleep(30)
xl.Quit()
time.sleep(30)
#VB syntax to add on excel workbook
'''Sub deleteDuplicate()
Dim ws As Worksheet
Dim wkbk1 As Workbook
Dim w As Long
Dim lRow As Long
Dim iCntr As Long
Set wkbk1 = Workbooks.Open("H:/SLM_Final/SLM Indicator template Main to clean.xlsx")
'Set wkbk1 = ThisWorkbook
wkbk1.Activate
With wkbk1
For w = 1 To .Worksheets.Count
With Worksheets(w)
.UsedRange.RemoveDuplicates Columns:=Array(3, 4), Header:=xlYes
End With
Next w
End With
wkbk1.Save
wkbk1.Close
End Sub''''
#
#Transpose files back to the original shape
import pyexcel
import pyexcel_xlsx as pe
from pyexcel_xlsx import get_data
book = pyexcel.get_book(file_name="H:/SLM_Final/SLM Indicator template Main to clean.xlsx")
for sheet in book:
sheet.transpose()
#sheet.delete_duplicates(keep=False, inplace=False)
pass
book.save_as("H:/SLM_Final/SLM Indicator template Main to clean.xlsx")
I hope this will help.

Converting multiple xls files to xlsx- issues with scaling up from single file

We have a few thousand xls files, with dozens of sheets in each file. We are working on a larger project to combine the files and sheets, but first need to convert them to xlsx.
The following code works fine on a single file:
import xlrd
from openpyxl.workbook import Workbook as openpyxlWorkbook
xlsBook = xlrd.open_workbook(C://path)
workbook = openpyxlWorkbook()
for i in xrange(0, xlsBook.nsheets):
xlsSheet = xlsBook.sheet_by_index(i)
sheet = workbook.active if i == 0 else workbook.create_sheet()
sheet.title = xlsSheet.name
for row in xrange(0, xlsSheet.nrows):
for col in xrange(0, xlsSheet.ncols):
sheet.cell(row=row+1, column=col+1).value = xlsSheet.cell_value(row, col)
workbook.save(c://path/workbook.xlsx")
This works perfectly.
When attempting to loop through all files, we use:
import xlrd
from openpyxl.workbook import Workbook as openpyxlWorkbook
import glob
import pandas as pd
from pandas import ExcelWriter
import os
path ="C://path"
path2 = "C://path2"
allFiles = glob.glob(path + "/*.xls")
for file_ in allFiles:
xlsBook = xlrd.open_workbook(file_)
workbook = openpyxlWorkbook()
for i in xrange(0, xlsBook.nsheets):
xlsSheet = xlsBook.sheet_by_index(i)
sheet = workbook.active if i == 0 else workbook.create_sheet()
sheet.title = xlsSheet.name
for row in xrange(0, xlsSheet.nrows):
for col in xrange(0, xlsSheet.ncols):
sheet.cell(row=row+1, column=col+1).value = xlsSheet.cell_value(row, col)
##workbook.save(os.path.join(path2,file_))
##workbook.to_excel(os.path.join(path2,file_))
workbook.save("C://path/workbook.xlsx")
For the first two commented out save methods, workbook.save seems to do absolutely nothing, and to_excel tells me workbook does not have a property called to_excel...is that because I didn't call pandas in the loop?
The final workbook.save was a test- I assumed it would save the final iteration of the loop correctly, since it worked in the script with just one file.
Instead, it creates the file, with all of the worksheets correctly named, but no data in any of the worksheets.
Any idea what I am missing? To be clear, I am looking to have each file named with its original filename at the end of the loop, and a valid xlsx extension.
I'd try this way instead. Simpler code and it worked when I tested it.
import pandas as pd
import glob
def converter(filename):
xl = pd.ExcelFile(filename) # reads file in
sheet_names = xl.sheet_names # gets the sheet names of the file
sheets_dict = {} # dictionary with sheet_names as keys and data as values
for sheet in sheet_names:
sheets_dict[sheet] = xl.parse(sheet)
writer = pd.ExcelWriter(r'C:\Users\you\Desktop\\' + filename.split('\\')[-1][:-4] + '.xlsx') # takes the file path and only returns the file name, now with format xlsx
for sheet_name, data in sheets_dict.iteritems():
data.to_excel(writer, sheet_name, index = False)
writer.save()
files = glob.glob(r'C:\Users\you\Desktop' + '\*.xls')
for file in files:
converter(file)
Edit: I'm not too familiar with openpyxl but I don't believe it has a .to_excel method. I think you were creating a openpyxl workbook but then trying to save it using a pandas method.

Categories

Resources