Compare 2 columns in different excel workbooks: Python - python

I am trying to compare 2 Excel columns in different workbooks using Ppenpyxl in Python. So far what I've got is:
#Load the workbooks
wkb1 = load_workbook(filename = os.path.join(srcdir, "wbk1.xlsx"))
wkb2 = load_workbook(filename = os.path.join(srcdir, "wbk2.xlsx"))
#Find the last row of the excel data
ws1 = wkb1.active
wkb1_LastRow = ws1.max_row
ws2 = wkb2.active
wkb2_LastRow = ws2.max_row
for xrow in range (1,(wkb1_LastRow+1)):
for yrow in range (1,(wkb2_LastRow+1)):
print (ws1.cell(row=xrow, column=1).value, ws2.cell(row=yrow, column=1).value )
if ws1.cell(row=xrow, column=1).value == ws2.cell(row=yrow, column=1).value:
print('HIT')
The thing is that the if statement always fails even though the 2 columns contain same values:
...
3145728 3145728,
3145728 3145729,
3145728 3145730,
3145728 3145731,
...
Any ideas?

FWIW using nested loops is not the way to do this. It is much simpler to use zip.
The following should work:
for src, target in zip(ws1['A'], ws2['A']):
if src.value == target.value:
print("Hit")

Related

Compare sheets across in two Excel files that have many sheets

I am looking to compare two Excel workbooks that should be identical to identify where there are differences between the content.
The below code I found here works great, but I have some workbooks that have many varying numbers of sheets (some will have one sheet, others will have 70 sheets across the two workbooks). Is there a way to iterate through all of the dataframes/sheets in the workbook (e.g. a range of indices) without having to hard code
the index numbers? Thanks!
In block 1
sheet1 = rb1.sheet_by_index(0)
Then in block 2
sheet1 = rb1.sheet_by_index(1)
Then in block 3
sheet1 = rb1.sheet_by_index(2)
from itertools import izip_longest
import xlrd
rb1 = xlrd.open_workbook('file1.xlsx')
rb2 = xlrd.open_workbook('file2.xlsx')
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2.sheet_by_index(0)
for rownum in range(max(sheet1.nrows, sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
row_rb2 = sheet2.row_values(rownum)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
else:
print "Row {} missing".format(rownum+1)```
You can make use of "openpyxl" library to get the sheetnames for an excel.
from openpyxl import load_workbook
excel = load_workbook(filepath, read_only=True,keep_links=False)
sheet_names = excel.sheetnames
print(sheet_names) # prints the sheetnames of the excel in form of a list
You can iterate over the "sheet_names" variable
for sheet in sheet_names:
df = pd.read_excel(filepath,sheet,engine='openpyxl')

Is there a way to export a list of 100+ dataframes to excel?

So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.

Getting staircase output from openpyxl when importing data

I'm trying to import data from multiple sheets to another in excel, and in order to do this I need python to input the data into the first empty cell, instead of overwriting the data from the last file. It seems to almost work, however, each column is jumping to its "own" empty row, and not staying in the correct row with the rest of its matching data, creating a staircase type pattern.
This is my code
import os
import openpyxl
os.chdir('C:\\Users\\XX\\Desktop')
wb1 = openpyxl.load_workbook('Test file python.xlsx', data_only = True) #open source excel file
ws1 = wb1.worksheets[0]
wb2 = openpyxl.load_workbook('test3.xlsx', data_only = True) #destination excel file
ws2 = wb2.active
#row_offset = ws2.max_row + 1
for i in range(10,150):
for j in range(3,13):
c = ws1.cell(row = i, column = j)
rowOffset = ws2.max_row + 1
rowNum = rowOffset
ws2.cell(row = rowNum, column = j-2).value = c.value
wb2.save('test3.xlsx')
Here is a screenshot of the output in excel Staircase output
You are changing ws2.max_row each time you put something in ws2 (i.e. - ws2.cell(row = rowNum, column = j-2).value = c.value) your max_row goes up by one affecting the entire loop creating that effect.
use current_row = ws2.max_row outside of the nested loop and it should fix your "staircase" issue.
Also, mind that when you run in the first iteration max_row == 1 that is why your sheet starts at row 2 and not at row 1.

Openpyxl - Copy range of cells(with formula) from a workbook to another

I'm trying to copy specific rows from Workbook 1 and append it to the existing data in Workbook 2.
Copy the highlighed rows from
Workbook 1,
and append them in Workbook 2 below 'March'
So far I succeeded to copy and paste the range, but there are two problems:
1.Cells are a shifted
2.The percentage(formula) is missing, leaving only numeric values.
See Result here
import openpyxl as xl
source = r"C:\Users\Desktop\Test_project_20200401.xlsx"
wbs = xl.load_workbook(source)
wbs_sheet = wbs["P2"] #selecting the sheet
destination = r"C:\Users\Desktop\Try999.xlsx"
wbd = xl.load_workbook(destination)
wbd_sheet = wbd["A3"] #select the sheet
row_data = 0
for row in wbs_sheet.iter_rows():
for cell in row:
if cell.value == "Yes":
row_data += cell.row
for row in wbs_sheet.iter_rows(min_row=row_data, min_col = 1, max_col=250, max_row = row_data+1):
wbd_sheet.append((cell.value for cell in row))
wbd.save(destination)
Does anyone have any idea on how can I solve this?
Any feedback/solution would help!
Thanks!
I think min_col should = 0
Range("A1").Formula (in VBA) gets the formula.
Range("A1").Value (in VBA) gets the value.
So try using .formula in Python
(thanks to: Get back a formula from a cell - VBA ... if this works)
Just want to add my own solution in here.
What I did, was to iterate through the columns and apply "cell.number_format = '0%', which converts your cell value to percentage.
for col in ws.iter_cols(min_row=1, min_col=2, max_row=250, max_col=250):
for cell in col:
cell.number_format = '0%'
More info can be found in here:
https://openpyxl.readthedocs.io/en/stable/_modules/openpyxl/styles/numbers.html

Conditional parsing and output of xlsx files with Openpyxl

I'm working through data for a research project. Output is in the form of .csv files, which have been converted to .xlsx files. There is a separate output file for each participant, with each file containing data on about 40 different measurements across several dozen (or so) stimuli. To make any sense of the data collected, we would need to look at each stimuli separately with relevant associated measurements. Each output file is large (50 columns by 60000 rows). I’m looking to parse the database using openpyxl to search for a cells in a pre-specified column with a particular string value. When such a cell is found, to then write that cell to a new workbook along with other specified columns in the same row.
For instance, parsing the following table, I’m trying to use openpyxl to search column A for ‘Slide 2’. When this value is found for a particular row, that cell is written to a new workbook along with the values in column C and D for that same row.
A B C D
1 Slide Data1 Data2 Data3
2 Slide 1 1 2 3
3 Slide 2 4 5 6
4 Slide 2 7 8 9
Would write:
A B C D
2 Slide 2 5 6
3
4
... or some similar format.
I would also look to fill column D and E with data from the next file, and F and G with data from the file after that (and so on), but I can probably figure that part out.
I’ve tried:
from openpyxl import load_workbook
wb = load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]
dest_filename = r'output.xlsx'
for x in range (0, 100): #0-100 as proof of concept before parsing entire worksheet
if ws.cell(row = x, column =26) == ‘some_image.jpg':
print (ws.cell(row =x, column =26), ws.cell(row = x, column = 10), ws.cell(row = x, column = 17))
wb.save = dest_filename
also with adding the following in an attempt to create a worksheet in memory within which to manipulate cells:
for i in range (0, 30):
for j in range (0, 100):
print (ws.cell(row =i, column=j))
... both with minor variations, but they all output a copy of the original file.
I’ve read and re-read the documentation for openpyxl but to no avail. There doesn’t seem to be any similar question on the forums here either.
Any insight in correctly manipulating and writing data would be greatly appreciated. I also hope this might help other people trying to make sense of huge datasets. Thanks in advance!
I'm on Windows 7 running Python3.3.2 (64 bit) with openpyxl-1.6.2. Data was originally in .csv format, so could be exported to .xls or other formats if this helps. I looked into xlutils (using xlwt and xlrd) briefly, but openpyxl worked better with xlsx files.
Edit
Many thanks to #MikeMüller for pointing out I needed two workbooks to transfer data between. That makes much more sense.
I now have the following, but it still returns an empty workbook. The original cells are not blank. (The commented lines are for simplification - without the indent, of course - but code not successful either way.)
import openpyxl
wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]
wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]
#n = 1
#for x in range (0, 1000):
#if ws.cell(row = x, column = 27) == '7.image2.jpg':
ws_out.cell(row = n, column = 1) == ws.cell(row = x, column = 26) #x changed
ws_out.cell(row = n, column = 2) == ws.cell(row = x, column = 10) #x changed
ws_out.cell(row = n, column = 3) == ws.cell(row = x, column = 17) #x changed
#n += 1
wb_out.save('output108.xlsx')
Edit 2
I've updated the code to include the .value for cells, but it still returns a blank workbook.
import openpyxl
wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]
wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]
n = 1
for x in range (0, 1000):
if ws.cell(row=x, column=27).value == '7.Image001.jpg':
ws_out.cell(row=n, column=1).value = ws.cell(row=x, column=27).value
ws_out.cell(row=n, column=2).value = ws.cell(row=x, column=10).value
ws_out.cell(row=n, column=3).value = ws.cell(row=x, column=17).value
n += 1
wb_out.save('output108.xlsx')
Summary for the next person with trouble:
You need to create two worksheets in memory. One to import your file, the to other to write to a new workbook file.
Use the cell.value call function to pull the text entered into each cell of your imported workbook, and set it = the desired cells in the exported workbook.
Make sure you start counting rows and columns at zero.
You are doing cell assignment incorrectly. Here's what should work:
import openpyxl
wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]
wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]
n = 1
for x in range (0, 1000):
if ws.cell(row=x, column=27).value == '7.image2.jpg':
ws_out.cell(row=n, column=1).value = ws.cell(row=x, column=26).value #x changed
ws_out.cell(row=n, column=2).value = ws.cell(row=x, column=10).value #x changed
ws_out.cell(row=n, column=3).value = ws.cell(row=x, column=17).value #x changed
n += 1
wb_out.save('output108.xlsx')
You need to open a second notebook for writing:
import openpyxl
wb_out = openpyxl.Workbook(dest_filename)
ws_out = wb_out.worksheets[0]
Put this in your loop:
ws_out.cell('cell indices here').value = desired_value
Save your file:
writer = openpyxl.ExelWriter(workbook=wb_out)
writer.save(dest_filename)

Categories

Resources