Compare sheets across in two Excel files that have many sheets - python

I am looking to compare two Excel workbooks that should be identical to identify where there are differences between the content.
The below code I found here works great, but I have some workbooks that have many varying numbers of sheets (some will have one sheet, others will have 70 sheets across the two workbooks). Is there a way to iterate through all of the dataframes/sheets in the workbook (e.g. a range of indices) without having to hard code
the index numbers? Thanks!
In block 1
sheet1 = rb1.sheet_by_index(0)
Then in block 2
sheet1 = rb1.sheet_by_index(1)
Then in block 3
sheet1 = rb1.sheet_by_index(2)
from itertools import izip_longest
import xlrd
rb1 = xlrd.open_workbook('file1.xlsx')
rb2 = xlrd.open_workbook('file2.xlsx')
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2.sheet_by_index(0)
for rownum in range(max(sheet1.nrows, sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
row_rb2 = sheet2.row_values(rownum)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
else:
print "Row {} missing".format(rownum+1)```

You can make use of "openpyxl" library to get the sheetnames for an excel.
from openpyxl import load_workbook
excel = load_workbook(filepath, read_only=True,keep_links=False)
sheet_names = excel.sheetnames
print(sheet_names) # prints the sheetnames of the excel in form of a list
You can iterate over the "sheet_names" variable
for sheet in sheet_names:
df = pd.read_excel(filepath,sheet,engine='openpyxl')

Related

I want to get some data from different sheet of the same file python

I want to get some data of a single file but from different sheet i tried the code below but it give only the data of the first sheet
from openpyxl import load_workbook
work = load_workbook(filename=r'the name of the file.xlsx',data_only=True)
for sheet in work.sheetnames[1:len(work.sheetnames)]:
n = 3
sheet1 = work[work.sheetnames[n]]
for val in sheet1.iter_rows(min_row=9, max_row=14, min_col=6, max_col=8, values_only=True):
print(str(sheet) + " " + str(val))
n += 1
Assume we have an .xlsx file with 5 sheets named sheet0, sheet1, sheet2 ..., sheet5
work.sheetnames gives you a list of all sheets ['sheet0', 'sheet1', 'sheet2', 'sheet3', ..., 'sheet5']
When you use:
for sheet in work.sheetnames[1:len(work.sheetnames)]:
means you are going to do your job from sheet1 to sheet5
After then you have:
n = 3
sheet1 = work[work.sheetnames[n]]
work.sheetnames[3] returns the string 'sheet3', so variable sheet1 is always work['sheet3'] in every iteration. I think this is why your result does not act the way you wish.
The code below probably can fix the problem.
from openpyxl import load_workbook
work = load_workbook(filename=r'the name of the file.xlsx',data_only=True)
for sheet in work.sheetnames[1:len(work.sheetnames)]:
sheet1 = work[sheet]
for val in sheet1.iter_rows(min_row=9, max_row=14, min_col=6, max_col=8, values_only=True):
print(str(sheet) + " " + str(val))

How do I loop through each source file and copy a specific column into a new workbook with each new "paste" shifting to the adjacent column?

I have 3 Excel files with a column of data in cells A1 to A10 (the "Source Cells") in each workbook (on sheet 1 in each workbook). I would like to copy the data from the Source Cells into a new workbook, but the data must shift into a new column each time.
For example:
the Source Cells in File 1 must be copied to cells A1 to A10 in the new workbook;
the Source Cells in File 2 must be copied to cells B1 to B10 in the new workbook; and
the Source Cells in File 3 must be copied to cells C1 to C10 in the new workbook.
I'm struggling to figure the best way to adjust "j" in my code on each iteration. I'm also not sure what the cleanest way is to run each function for the different source files.
All suggestions on how to make this code cleaner will also be appreciated because I admit it's so messy at the moment!
Thanks in advance!
import openpyxl as xl
filename_1 = "C:\\workspace\\scripts\\file1.xlsx"
filename_2 = "C:\\workspace\\scripts\\file2.xlsx"
filename_3 = "C:\\workspace\\scripts\\file3.xlsx"
destination_filename = "C:\\workspace\\scripts\\new_file.xlsx"
num_rows = 10
num_columns = 1
def open_source_workbook(path):
'''Open the workbook and worksheet in the source Excel file'''
workbook = xl.load_workbook(path)
worksheet = workbook.worksheets[0]
return worksheet
def open_destination_workbook(path):
'''Open the destination workbook I want to copy the data to.'''
new_workbook = xl.load_workbook(path)
return new_workbook
def open_destination_worksheet(path):
'''Open the worksheet of the destination workbook I want to copy the data to.'''
new_worksheet = new_workbook.active
return new_worksheet
def copy_to_new_file(worksheet, new_worksheet):
for i in range (1, num_rows + 1):
for j in range (1, num_columns + 1):
c = worksheet.cell(row = i, column = j)
new_worksheet.cell(row = i, column = j).value = c.value
worksheet = open_source_workbook(filename_1)
new_workbook = open_destination_workbook(destination_filename)
new_worksheet = open_destination_worksheet(new_workbook)
copy_to_new_file(worksheet, new_worksheet)
new_workbook.save(str(destination_filename))
Question: Loop files, copy a specific column, with each new “paste” shifting to the adjacent column?
This approach first aggregates from all files the Column Cell values.
Then rearange it so, that it can be used by the openpyxl.append(... method.
Therefore, no target Column knowledge are needed.
Reference:
class collections.OrderedDict([items])
Ordered dictionaries are just like regular dictionaries but have some extra capabilities relating to ordering operations.
openpyxl.utils.cell.coordinate_to_tuple(coordinate)
Convert an Excel style coordinate to (row, colum) tuple
iter_rows(min_row=None, max_row=None, min_col=None, max_col=None, values_only=False)
Produces cells from the worksheet, by row. Specify the iteration range using indices of rows and columns.
map(function, iterable, ...)
Return an iterator that applies function to every item of iterable, yielding the results.
zip(*iterables)
Make an iterator that aggregates elements from each of the iterables.
Used imports
import openpyxl as opxl
from collections import OrderedDict
Define the files in a OrderedDict to retain file <=> column order
file = OrderedDict.fromkeys(('file1', 'file2', 'file3'))
Define the Range as index values.
Convert the Excel A1 notation into index values
min_col, max_col, min_row, max_row =
opxl.utils.cell.range_to_tuple('DUMMY!A1:A10')[1]
Loop the defined files,
load every Workbook and get a reference to the default Worksheet
get the cell values from the defined range:
min_col=1, max_col=1, min_row=1, max_row=10
for fname in file.keys():
wb = openpyxl.load_workbook(fname)
ws = wb.current()
file[fname] = ws.iter_rows(min_row=min_row,
max_row=max_row,
min_col=min_col,
max_col=max_col,
values_only=True)
Define a new Workbook and get a reference to the default Worksheet
wb2 = opxl.Workbook()
ws2 = wb2.current()
Zip the values, Row per Row from all files
Map the ziped list of tuples using a lambda to flatten to a list of Row values.
Append the list of values to the new Worksheet
for row_value in map(lambda r:tuple(v for c in r for v in c),
zip(*(file[k] for k in file))
):
ws2.append(row_value)
Save the new Workbook
# wb2.save(...)

How to merge two excel files that has different table headings into one master file via python?

I have different excel sheets that has almost identical table headings but the orders are different. I would like to order them in the same format and merge those files via python.
Use openpyxl, would be something like that :
from openpyxl import Workbook, load_workbook
classeur1 = load_workbook('test1.xlsx')
classeur2 = load_workbook('test2.xlsx')
feuille1 = classeur1.active
feuille2 = classeur2.active
workbook_result = Workbook()
f_result = workbook_result.active
for row in feuille1.iter_rows():
for cell in row:
f_result.cell(row=cell.row, column=cell.column).value = cell.value
for row in feuille2.iter_rows():
for cell in row:
if f_result.cell(row=cell.row, column=cell.column).value:
f_result.cell(row=cell.row, column=cell.column).value += cell.value
else:
f_result.cell(row=cell.row, column=cell.column)
workbook_result.save('merged.xlsx')
maybe it will be so much different, it depends on data

Compare 2 columns in different excel workbooks: Python

I am trying to compare 2 Excel columns in different workbooks using Ppenpyxl in Python. So far what I've got is:
#Load the workbooks
wkb1 = load_workbook(filename = os.path.join(srcdir, "wbk1.xlsx"))
wkb2 = load_workbook(filename = os.path.join(srcdir, "wbk2.xlsx"))
#Find the last row of the excel data
ws1 = wkb1.active
wkb1_LastRow = ws1.max_row
ws2 = wkb2.active
wkb2_LastRow = ws2.max_row
for xrow in range (1,(wkb1_LastRow+1)):
for yrow in range (1,(wkb2_LastRow+1)):
print (ws1.cell(row=xrow, column=1).value, ws2.cell(row=yrow, column=1).value )
if ws1.cell(row=xrow, column=1).value == ws2.cell(row=yrow, column=1).value:
print('HIT')
The thing is that the if statement always fails even though the 2 columns contain same values:
...
3145728 3145728,
3145728 3145729,
3145728 3145730,
3145728 3145731,
...
Any ideas?
FWIW using nested loops is not the way to do this. It is much simpler to use zip.
The following should work:
for src, target in zip(ws1['A'], ws2['A']):
if src.value == target.value:
print("Hit")

Python splitting an Excel workbook

I am finding a way to split an Excel workbook, contains multiple tabs/sheets, into multiple workbooks, according to the numbers of tabs/sheets the original workbook has:
Worked out:
from xlrd import open_workbook
from xlwt import Workbook
rb = open_workbook('c:\\original file.xls',formatting_info=True)
for a in range(5): #for example there're only 5 tabs/sheets
rs = rb.sheet_by_index(a)
new_book = Workbook()
new_sheet = new_book.add_sheet('Sheet 1')
for row in range(rs.nrows):
for col in range(rs.ncols):
new_sheet.write(row, col, rs.cell(row, col).value)
new_book.save("c:\\" + str(a) + ".xls")
This is actually nothing but reading the sheets one by one, and save them one by one. Is there a better, or more direct way?

Categories

Resources