I have to port an algorithm from an Excel sheet to python code but I have to reverse engineer the algorithm from the Excel file.
The Excel sheet is quite complicated, it contains many cells in which there are formulas that refer to other cells (that can also contains a formula or a constant).
My idea is to analyze with a python script the sheet building a sort of table of dependencies between cells, that is:
A1 depends on B4,C5,E7 formula: "=sqrt(B4)+C5*E7"
A2 depends on B5,C6 formula: "=sin(B5)*C6"
...
The xlrd python module allows to read an XLS workbook but at the moment I can access to the value of a cell, not the formula.
For example, with the following code I can get simply the value of a cell:
import xlrd
#open the .xls file
xlsname="test.xls"
book = xlrd.open_workbook(xlsname)
#build a dictionary of the names->sheets of the book
sd={}
for s in book.sheets():
sd[s.name]=s
#obtain Sheet "Foglio 1" from sheet names dictionary
sheet=sd["Foglio 1"]
#print value of the cell J141
print sheet.cell(142,9)
Anyway, It seems to have no way to get the formul from the Cell object returned by the .cell(...) method.
In documentation they say that it is possible to get a string version of the formula (in english because there is no information about function name translation stored in the Excel file). They speak about formulas (expressions) in the Name and Operand classes, anyway I cannot understand how to get the instances of these classes by the Cell class instance that must contains them.
Could you suggest a code snippet that gets the formula text from a cell?
[Dis]claimer: I'm the author/maintainer of xlrd.
The documentation references to formula text are about "name" formulas; read the section "Named references, constants, formulas, and macros" near the start of the docs. These formulas are associated sheet-wide or book-wide to a name; they are not associated with individual cells. Examples: PI maps to =22/7, SALES maps to =Mktng!$A$2:$Z$99. The name-formula decompiler was written to support inspection of the simpler and/or commonly found usages of defined names.
Formulas in general are of several kinds: cell, shared, and array (all associated with a cell, directly or indirectly), name, data validation, and conditional formatting.
Decompiling general formulas from bytecode to text is a "work-in-progress", slowly. Note that supposing it were available, you would then need to parse the text formula to extract the cell references. Parsing Excel formulas correctly is not an easy job; as with HTML, using regexes looks easy but doesn't work. It would be better to extract the references directly from the formula bytecode.
Also note that cell-based formulas can refer to names, and name formulas can refer both to cells and to other names. So it would be necessary to extract both cell and name references from both cell-based and name formulas. It may be useful to you to have info on shared formulas available; otherwise having parsed the following:
B2 =A2
B3 =A3+B2
B4 =A4+B3
B5 =A5+B4
...
B60 =A60+B59
you would need to deduce the similarity between the B3:B60 formulas yourself.
In any case, none of the above is likely to be available any time soon -- xlrd priorities lie elsewhere.
Update: I have gone and implemented a little library to do exactly what you describe: extracting the cells & dependencies from an Excel spreadsheet and converting them to python code. Code is on github, patches welcome :)
Just to add that you can always interact with excel using win32com (not very fast but it works). This does allow you to get the formula. A tutorial can be found here [cached copy] and details can be found in this chapter [cached copy].
Essentially you just do:
app.ActiveWorkbook.ActiveSheet.Cells(r,c).Formula
As for building a table of cell dependencies, a tricky thing is parsing the excel expressions. If I remember correctly the Trace code you mentioned does not always do this correctly. The best I have seen is the algorithm by E. W. Bachtal, of which a python implementation is available which works well.
So I know this is a very old post, but I found a decent way of getting the formulas from all the sheets in a workbook as well as having the newly created workbook retain all the formatting.
First step is to save a copy of your .xlsx file as .xls
-- Use the .xls as the filename in the code below
Using Python 2.7
from lxml import etree
from StringIO import StringIO
import xlsxwriter
import subprocess
from xlrd import open_workbook
from xlutils.copy import copy
from xlsxwriter.utility import xl_cell_to_rowcol
import os
file_name = '<YOUR-FILE-HERE>'
dir_path = os.path.dirname(os.path.realpath(file_name))
subprocess.call(["unzip",str(file_name+"x"),"-d","file_xml"])
xml_sheet_names = dict()
with open_workbook(file_name,formatting_info=True) as rb:
wb = copy(rb)
workbook_names_list = rb.sheet_names()
for i,name in enumerate(workbook_names_list):
xml_sheet_names[name] = "sheet"+str(i+1)
sheet_formulas = dict()
for i, k in enumerate(workbook_names_list):
xmlFile = os.path.join(dir_path,"file_xml/xl/worksheets/{}.xml".format(xml_sheet_names[k]))
with open(xmlFile) as f:
xml = f.read()
tree = etree.parse(StringIO(xml))
context = etree.iterparse(StringIO(xml))
sheet_formulas[k] = dict()
for _, elem in context:
if elem.tag.split("}")[1]=='f':
cell_key = elem.getparent().get(key="r")
cell_formula = elem.text
sheet_formulas[k][cell_key] = str("="+cell_formula)
sheet_formulas
Structure of Dictionary 'sheet_formulas'
{'Worksheet_Name': {'A1_cell_reference':'cell_formula'}}
Example results:
{u'CY16': {'A1': '=Data!B5',
'B1': '=Data!B1',
'B10': '=IFERROR(Data!B12,"")',
'B11': '=IFERROR(SUM(B9:B10),"")',
It seems that it is impossible now to do what you want with xlrd. You can have a look at this post for the detailed description of why it is so difficult to implement the functionality you need.
Note that the developping team does a great job for support at the python-excel google group.
I know this post is a little late but there's one suggestion that hasn't been covered here. Cut all the entries from the worksheet and paste using paste special (OpenOffice). This will convert the formulas to numbers so there's no need for additional programming and this is a reasonable solution for small workbooks.
Ye! With win32com it's works for me.
import win32com.client
Excel = win32com.client.Dispatch("Excel.Application")
# python -m pip install pywin32
file=r'path Excel file'
wb = Excel.Workbooks.Open(file)
sheet = wb.ActiveSheet
#Get value
val = sheet.Cells(1,1).value
# Get Formula
sheet.Cells(6,2).Formula
Related
Forgive me if this is an idiotic question. Im new to coding and wanted to automate part of my workflow.
Im enjoying the puzzle so i won't ask too many questions. But im stuck on this
Every time an order comes in, I have to copy data from raw excel files to a templates.
I want to replace the three headers at the top of this page with variables ive already extracted from the raw excel data.
enter image description here
so that it would look like this on every page
enter image description here
In every tutorial I see, their "header" is just row 1.
I think xlsxwriter has the ability to change those headers looks like that only on new worksheets.
df1.to_clipboard(index=False, header=False) #Copies df1 to clipboard (BOM Data)
ws.Range("A2").Select()
ws.PasteSpecial(Format='Unicode Text') # Paste as text in template
*#So at this point i guess im using pywin32 to copy and paste but have to use switch back to xlsxwriter to change the header?*
wb = xlsxwriter.Workbook(r'C:\Users\jfras\Desktop\Auto BOM\PARKER BOM TEMPLATE.xlsx')
ws = wb.Worksheets(1)
header1 = '&CTest Entry'*#So at this point i guess im using pywin32 to copy and paste but have to use switch back to xlsxwriter to change the header?*
wb = xlsxwriter.Workbook(r'C:\Users\jfras\Desktop\Auto BOM\PARKER BOM TEMPLATE.xlsx')
ws = wb.Worksheets(1)
header1 = '&CTest Entry'
Your question is a little unclear, the screenshots you attached look to be inside of word. It seems like you are trying to automate moving data from excel into a word document template, is that correct?
If I understand correctly, you will need to use a python package to read your excel document, then use a python package to insert that data into a parameterized template in word. Here is an article explaining doing exactly that.
In a nutshell, using Openpyxl (or presumably any python excel reader of your choosing) you would read the excel sheet, then "plug-in" your data into a word template using something like Python-docx. The article linked above contains code snippets explaining this process in more detail.
I hope I understood your question right. If so, something like this code below may work:
import xlsxwriter
workbook = xlsxwriter.Workbook('teste.xlsx')
worksheet = workbook.add_worksheet()
worksheet.set_header('&L P10853' + '&CTEST OBJECT' + '&RUN_28583')
workbook.close()
Of course, if you just run this code you gonna end up having an empty sheet that prints nothing until you fill at least one cell.
But, anyway, you can understand the code like, the command set_header it's the mandatory here and it's doing what we want. When you put a string with &L you setting the left header &C for the center header and &R for the right header. You can see more in https://xlsxwriter.readthedocs.io/example_headers_footers.html
I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!
So I'm trying to efficiently delete rows in excel spreadsheets that meet a certain criteria. However, I think that it would probably be faster if I could use some built in VBA functionality that would select all the rows that I want to delete, and then delete them all at once.
The following link is the VBA equivalent of something I'd like to do using the python win32com.client module:
https://stackoverflow.com/a/31390697/9453718
I am currently just looping through and deleting rows as I go. I go through all the rows in excel, and if it meets some criteria I call: Sheet.Rows(r).EntireRow.Delete()
Any suggestions?
Edit: here's a sample of what I currently have that deletes each row one by one
import win32com.client as win32
excel = win32.gencache.EnsureDispatch('Excel.Application')
book = excel.Workbooks.Open("ExcelWith200000Lines.xlsm")
sheet = book.Worksheets(0) #gets first sheet
nrows = sheet.UsedRange.Row + sheet.UsedRange.Rows.Count-1 # number of rows
ncols = sheet.UsedRange.Column + sheet.UsedRange.Columns.Count-1 # num of cols
data = sheet.Range('A1:'+NumToLetter(ncols) + str(nrows)).Value #loads all the data on the sheet into the python script
RELEVANT_COL_INDEX=0
for r in range(nrows-1, -1, -1): #starts at last row and goes to first row
if data[r][RELEVANT_COL_INDEX] == "delete_me":
sheet.Rows(r+1).EntireRow.Delete() #deletes the row here
book.SaveAs("myOutput.xlsm")
The SELECT CASE statement is not part of the object library of Excel but is part of the VBA language. When you connect Python as COM interface to Excel using win32com, you are only connecting to Excel and its object library including its objects (workbooks, worksheets, charts, etc.) and their properties and methods.
In fact, VBA does exactly what you are doing in Python: COM interface to the Excel object library. See under Tools \ References and find VBA is usually the first selected, referenced library. Hence it is not part of Excel. The very Select Case docs does not indicate anywhere that is an Excel method. Thus, you can use SELECT CASE in MS Access VBA, Word VBA, Outlook VBA, etc.
Therefore, use the analogous version of VBA's SELECT CASE in Python which likely would be multi-line if and elif statements. While, other languages like Java, PHP, and R maintain the switch method, Python does not have this. See Replacements for switch statement in Python.
Consider below example using your linked Excel question:
if not (xlwsh.Cells(i, 2) in [0.4, 0.045, 0.05, 0.056, 0.063, 0.071, 0.08, 0.09]
or xlwsh.Cells(i, 2) < 0):
xlwsh.Rows(i).EntireRow.Delete
I am trying to use win32com to copy a worksheet from my workbook to a new workbook. The code is working fine but the cell formulas in the new book point back to the original book. I would like to break the links in the new book so that these formulas are replaced with raw numbers. This is trivial to do in Excel but I haven't been able to find out how to do it using the win32com client in Python.
Here is a snippet of my code:
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
xl.Visible = True
#Open & Refresh Spreadsheet
wb = xl.Workbooks.Open(r"C:\Users\me\dummy.xlsx") #Dummy path
print("Refreshing data...")
wb.RefreshAll()
#Create new book and copy target sheet over
print("Opening new workbook")
nwb = xl.Workbooks.Add()
newfile = r"C:\Users\me\dummy2.xlsx"
wb.Worksheets(["Target Sheet"]).Copy(Before=nwb.Worksheets(1))
nwb.SaveAs(newfile)
This code works fine but in the saved "dummy2" file each of the cells containing formulas reference the original sheet. How can I break the links in the new book and/or copy values only from the original book?
Edit in response to #martineau 's downvote of the answer and of the (admittedly unsatisfactory) Microsoft documentation.
I think you haven't been able to find out how to do this because you have been looking in the wrong place. Your question really has little to do with Python or with win32com.
This line
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
fires up a COM client called xl that talks to excel.exe. Your variable xl is a thin Python wrapper around a Microsoft COM object that can call Excel VBA functions. When you type xl., everything after the dot is expected to be a VBA object or method. Any value (other than strings and floats) that you get back from a call is a VBA object in a thin Python wrapper. Python conventions do not necessarily apply to such objects.
So to find out about what functions you need to call, you need to be looking at the Excel VBA documentation. One difficulty with that documentation is that it assumes you are writing VBA, not Python. The other is that it isn't all that well-written.
The VBA method you need is Workbook.BreakLink().
Call it after copying the original workbook and before saving the copy, like this (I'm using your dummy filename here, don't expect it to actually work without fixing that):
wb.Worksheets(["Target Sheet"]).Copy(Before=nwb.Worksheets(1))
nwb.BreakLink(Name=r"C:\Users\me\dummy.xlsx", Type=1)
nwb.SaveAs(newfile)
The name of the link is the filename it points to, and the type of the link is 1 (for a link to an Excel spreadsheet). In this case you know the name of the link source (since you just made a copy of it) so there is no need to ask what the filename is, but in the general case you need to call Workbook.LinkSources() to find out what they are, and break them one by one.
My issue is below but would be interested comments from anyone with experience with xlrd.
I just found xlrd and it looks like the perfect solution but I'm having a little problem getting started. I am attempting to extract data programatically from an Excel file I pulled from Dow Jones with current components of the Dow Jones Industrial Average (link: http://www.djindexes.com/mdsidx/?event=showAverages)
When I open the file unmodified I get a nasty BIFF error (binary format not recognized)
However you can see in this screenshot that Excel 2008 for Mac thinks it is in 'Excel 1997-2004' format (screenshot: http://skitch.com/alok/ssa3/componentreport-dji.xls-properties)
If I instead open it in Excel manually and save as 'Excel 1997-2004' format explicitly, then open in python usig xlrd, everything is wonderful. Remember, Office thinks the file is already in 'Excel 1997-2004' format. All files are .xls
Here is a pastebin of an ipython session replicating the issue: http://pastie.textmate.org/private/jbawdtrvlrruh88mzueqdq
Any thoughts on:
How to trick xlrd into recognizing the file so I can extract data?
How to use python to automate the explicit 'save as' format to one that xlrd will accept?
Plan B?
FWIW, I'm the author of xlrd, and the maintainer of xlwt (a fork of pyExcelerator). A few points:
The file ComponentReport-DJI.xls is misnamed; it is not an XLS file, it is a tab-separated-values file. Open it with a text editor (e.g. Notepad) and you'll see what I mean. You can also look at the not-very-raw raw bytes with Python:
>>> open('ComponentReport-DJI.xls', 'rb').read(200)
'COMPANY NAME\tPRIMARY EXCHANGE\tTICKER\tSTYLE\tICB SUBSECTOR\tMARKET CAP RANGE\
tWEIGHT PCT\tUSD CLOSE\t\r\n3M Co.\tNew York SE\tMMM\tN/A\tDiversified Industria
ls\tBroad\t5.15676229508\t50.33\t\r\nAlcoa Inc.\tNew York SE\tA'
You can read this file using Python's csv module ... just use delimiter="\t" in your call to csv.reader().
xlrd can read any file that pyExcelerator can, and read them better—dates don't come out as floats, and the full story on Excel dates is in the xlrd documentation.
pyExcelerator is abandonware—xlrd and xlwt are alive and well. Check out http://groups.google.com/group/python-excel
HTH
John
xlrd support for Office 2007/2008 (OpenXML) format is in alpha test - see the following post in the python-excel newsgroup:
http://groups.google.com/group/python-excel/msg/0c5f15ad122bf24b?hl=en
More info on pyExcelerator: To read a file, do this:
import pyExcelerator
book = pyExcelerator.parse_xls(filename)
where filename is a string that is the filename to read (not a file-like object). This will give you a data structure representing the workbook: a list of pairs, where the first element of the pair is the worksheet name and the second element is the worksheet data.
The worksheet data is a dictionary, where the keys are (row, col) pairs (starting with 0) and the values are the cell contents -- generally int, float, or string. So, for instance, in the simple case of all the data being on the first worksheet:
data = book[0][1]
print 'Cell A1 of worksheet %s is: %s' % (book[0][0], repr(data[(0, 0)]))
If the cell is empty, you'll get a KeyError. If you're dealing with dates, they may (I forget) come through as integers or floats; if this is the case, you'll need to convert. Basically the rule is: datetime.datetime(1899, 12, 31) + datetime.timedelta(days=n) but that might be off by 1 or 2 (because Excel treats 1900 as a leap-year for compatibility with Lotus, and because I can't remember if 1900-1-1 is 0 or 1), so do some trial-and-error to check. Datetimes are stored as floats, I think (days and fractions of a day).
I think there is partial support for forumulas, but I wouldn't guarantee anything.
Well here is some code that I did: (look down the bottom): here
Not sure about the newer formats - if xlrd can't read it, xlrd needs to have a new version released !
Do you have to use xlrd? I just downloaded 'UPDATED - Dow Jones Industrial Average Movers - 2008' from that website and had no trouble reading it with pyExcelerator.
import pyExcelerator
book = pyExcelerator.parse_xls('DJIAMovers.xls')