I am regularly receiving a spreadsheet from an external source (via google docs) that I have to convert into a local (kinda proprietary) format. To do that, I have written a script that can convert the spreadsheet as an ODS file into the needed format (non-ODS).
This script needs to interact with a lot of higher-level business-specific PHP stuff, so I use PhpSpreadsheet for this purpose (https://github.com/PHPOffice/PhpSpreadsheet/).
This PHP library does theoretically everything I need, but it cannot deal with overly complex spreadsheets without taking an gigantic amount of time dealing with all the cross-referencing formulas. To speed up the processing in the script, I manually prepare the ODS file by hand by converting all formulas to values (Select all Cells in the needed Sheets, then trigger [Data] > [Calculate] > [Formula to Value]) in the needed sheets. Then I delete all the unneeded sheets (which otherwise only contain source-data for the replaced formulas). The resulting file is a lot smaller and does not contain any formulas. The execution of the PHP script finishes within a few seconds with the simplified spreadsheet file, while it runs out of memory after a long while with the original spreadsheet file.
I now seek to automate this process of converting all the formulas to values using a new python script (This needs to happen on a linux server, so my best bet would be a headless libreoffice controlled via an UNO socket in python, correct?).
So far I have managed to connect to the libreoffice UNO socket and manipulate the cells via the old OpenOffice-API (https://www.openoffice.org/api/docs/common/ref/com/sun/star/sheet/module-ix.html).
My current big question is:
How do I access the UI-Formula to Value-functionality on all cells of a sheet at once via the UNO API in Python?
I have tried searching the old OpenOffice API documentation for this for a while, but so far I cannot find what I am looking for.
Currently the python script looks (in essence) like this:
import uno
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext(
"com.sun.star.bridge.UnoUrlResolver",
localContext
)
context = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
serviceManager = context.ServiceManager
desktop = serviceManager.createInstanceWithContext("com.sun.star.frame.Desktop", context)
# com.sun.star.lang.XComponent / com.sun.star.sheet.SpreadsheetDocument
document = desktop.getCurrentComponent()
# com.sun.star.sheet.XSpreadsheets / XNameAccess
sheets = document.getSheets()
# com.sun.star.sheet.XSpreadsheet
# https://www.openoffice.org/api/docs/common/ref/com/sun/star/sheet/XSpreadsheet.html
sheet = sheets.getByName('OneOfTheSheets')
#print(sheet.getCellRangeByName("A1:AP1000"))
# WAY TOO SLOW AND DESTRUCTIVE:
for row in range(0, 1000):
for column in range(0, 42):
cell = sheet.getCellByPosition(column, row)
cell.setFormula(cell.getString())
Thank you for any help you can provide.
Related
I am developing one application in python using MS Excel as my frontend for python. Where I need to read excel cells every time to check if the value is update by user. After Update there is some logic to be called to Pull data from DB and put it to Next cell.
Input as my search field and output as a table
I am using following Logic but its consuming a lot of resources and making life difficult if using cursor to move from one cell to another.
sht = self.wb.sheets[0]
while True:
search_id = sht.range('C10').value
search_result = search_for_id(search_id)
sht.range("C12").value = search_result
time.sleep(1)
My excel is getting stuck every time i use. I can increase delay but it will affect my performance. I need this to work on live basis. without even a single second of delay will cost much.
Note: I am working on fintech project. for ease of understanding made my scenario very simple
Is there a way to update a spreadsheet in real time while it is open in Excel? I have a workbook called Example.xlsx which is open in Excel and I have the following python code which tries to update cell B1 with the string 'ID':
import openpyxl
wb = openpyxl.load_workbook('Example.xlsx')
sheet = wb['Sheet']
sheet['B1'] = 'ID'
wb.save('Example.xlsx')
On running the script I get this error:
PermissionError: [Errno 13] Permission denied: 'Example.xlsx'
I know its because the file is currently open in Excel, but was wondering if there is another way or module I can use to update a sheet while its open.
I have actually figured this out and its quite simple using xlwings. The following code opens an existing Excel file called Example.xlsx and updates it in real time, in this case puts in the value 45 in cell B2 instantly soon as you run the script.
import xlwings as xw
wb = xw.Book('Example.xlsx')
sht1 = wb.sheets['Sheet']
sht1.range('B2').value = 45
You've already worked out why you can't use openpyxl to write to the .xlsx file: it's locked while Excel has it open. You can't write to it directly, but you can use win32com to communicate with the copy of Excel that is running via its COM interface.
You can download win32com from https://github.com/mhammond/pywin32 .
Use it like this:
from win32com.client import Dispatch
xlApp = Dispatch("Excel.Application")
wb=xlApp.Workbooks.Item("MyExcelFile.xlsx")
ws=wb.Sheets("MyWorksheetName")
At this point, ws is a reference to a worksheet object that you can change. The objects you get back aren't Python objects but a thin Python wrapper around VBA objects that obey their own conventions, not Python's.
There is some useful if rather old Python-oriented documentation here: http://timgolden.me.uk/pywin32-docs/contents.html
There is full documentation for the object model here: https://msdn.microsoft.com/en-us/library/wss56bz7.aspx but bear in mind that it is addressed to VBA programmers.
If you want to stream real time data into Excel from Python, you can use an RTD function. If you've ever used the Bloomberg add-in use for accessing real time market data in Excel then you'll be familiar with RTD functions.
The easiest way to write an RTD function for Excel in Python is to use PyXLL. You can read how to do it in the docs here: https://www.pyxll.com/docs/userguide/rtd.html
There's also a blog post showing how to stream live tweets into Excel using Python here: https://www.pyxll.com/blog/a-real-time-twitter-feed-in-excel/
If you wanted to write an RTD server to run outside of Excel you have to register it as a COM server. The pywin32 package includes an example that shows how to do that, however it only works for Excel prior to 2007. For 2007 and later versions you will need this code https://github.com/pyxll/exceltypes to make that example work (see the modified example from pywin32 in exceltypes/demos in that repo).
You can't change an Excel file that's being used by another application because the file format does not support concurrent access.
I created a little script in python to generate an excel compatible xml file (saved with xls extension). The file is generated from a part database so I can place an order with the extracted data.
On the website for ordering the parts, you can import the excel file so the order fills automatically. The problem here is that each time I want to make an order, I have to open excel and save the file with xls extension of type MS Excel 97-2003 to get the import working.
The excel document then looks exactly the same, but when opened with notepad, we cannot see the xml anymore, only binary dump.
Is there a way to automate this process, by running a bat file or maybe adding some line to my python script so it is converted in the proper format?
(I know that question has been asked before, but it never has been answered)
There are two basic approaches to this.
You asked about the first: Automating Excel to open and save the file. There are in fact two ways to do that. The second is to use Python tools that can create the file directly in Python without Excel's help. So:
1a: Automating Excel through its automation interface.
Excel is designed to be controlled by external apps, through COM automation. Python has a great COM-automation interface inside of pywin32. Unfortunately, the documentation on pywin32 is not that great, and all of the documentation on Excel's COM automation interface is written for JScript, VB, .NET, or raw COM in C. Fortunately, there are a number of questions on this site about using win32com to drive Excel, such as this one, so you can probably figure it out yourself. It would look something like this:
import win32com.client
excel = win32com.client.Dispatch('Excel.Application')
spreadsheet = excel.Workbooks.Open('C:/path/to/spreadsheet.xml')
spreadsheet.SaveAs('C:/path/to/spreadsheet.xls', fileformat=excel.xlExcel8)
That isn't tested in any way, because I don't have a Windows box with Excel handy. And I vaguely remember having problems getting access to the fileformat names from win32com and just punting and looking up the equivalent numbers (a quick google for "fileformat xlExcel8" shows that the numerical equivalent is 56, and confirms that's the right format for 97-2003 binary xls).
Of course if you don't need to do it in Python, MSDN is full of great examples in JScript, VBA, etc.
The documentation you need is all on MSDN (since the Office Developer Network for Excel was merged into MSDN, and then apparently became a 404 page). The top-level page for Excel is Welcome to the Excel 2013 developer reference (if you want a different version, click on "Office client development" in the navigation thingy above and pick a different version), and what you mostly care about is the Object model reference. You can also find the same documentation (often links to the exact same webpages) in Excel's built-in help. For example, that's where you find out that the Application object has a Workbooks property, which is a Workbooks object, which has Open and Add methods that return a Workbook object, which has a SaveAs method, which takes an optional FileFormat parameter of type XlFileFormat, which has a value xlExcel8 = 56.
As I implied earlier, you may not be able to access enumeration values like xlExcel8 for some reason which I no longer remember, but you can look the value up on MSDN (or just Google it) and put the number 56 instead.
The other documentation (both here and elsewhere within MSDN) is usually either stuff you can guess yourself, or stuff that isn't relevant from win32com. Unfortunately, the already-sparse win32com documentation expects you to have read that documentation—but fortunately, the examples are enough to muddle your way through almost everything but the object model.
1b: Automating Excel via its GUI.
Automating a GUI on Windows is a huge pain, but there are a number of tools that make it a whole lot easier, such as pywinauto. You may be able to just use swapy to write the pywinauto script for you.
If you don't need to do it in Python, separate scripting systems like AutoIt have an even larger user base and even more examples to make your life easier.
2: Doing it all in Python.
xlutils, part of python-excel, may be able to do what you want, without touching Excel at all.
Using python I need to be able to do the following operations to a workbook for excel 2007:
delete rows
sorting a worksheet
getting distinct values from a column
I am looking into openpyxl; however, it seems to have limited capabilities.
Can anyone please recommend a library that can do the above tasks?
I want to preface this with letting you know this is only a windows based solution. But if you are using Windows I would recommend using Win32Com which can be found here. This module gives Python programmatic access to any Microsoft Office Application (including Excel) and uses many of the same methods used in VBA. Usually what you will do is record a macro (or recall from memory) how to do something in VBA and then use the same functions in Python
To start we want to connect to Excel and get access to the first sheet as an example
#First we need to access the module that lets us connect to Excel
import win32com.client
# Next we want to create a variable that represents Excel
app = win32com.client.Dispatch("Excel.Application")
# Lastly we will assume that the workbook is active and get the first sheet
wbk = app.ActiveWorkbook
sheet = wbk.Sheets(1)
At this point we have a variable named sheet that represents the excel work sheet we will be working with. Of course there are multiple ways to access the sheet, this is usually the way I demo how to use win32com with excel because it is very intuitive.
Now assume I have the following values on the first sheet and I will go over one by one how to answer what you were asking:
A
1 "d"
2 "c"
3 "b"
4 "a"
5 "c"
Delete Rows:
Lets assume that you want to delete the first row in your active sheet.
sheet.Rows(1).Delete()
This creates:
A
1 "c"
2 "b"
3 "a"
4 "c"
Next Lets sort the cells in ascending order (although I would recommend extracting the values to python and doing the sorting within a list and sending the values back)
rang = sheet.Range("A1","A4")
sheet.Sort.SetRange(rang)
sheet.Sort.Apply()
This creates:
A
1 "a"
2 "b"
3 "c"
4 "c"
And now we will get distinct values from the column. The main thing to take away here is how to extract values from a cells. You can either select a lot of cells at once and with sheet.Range("A1","A4") or you can access the values by iterating over cell by cell with sheet.Cells(row,col). Range is orders of magnitude faster, but Cells is slightly easier for debugging.
#Get a list of all Values using Range
valLstRange = [val[0] for val in sheet.Range("A1","A4").Value]
#Get a list of all Values using Cells
valLstCells = [sheet.Cells(row,1).Value for row in range(1,4)]
#valLstCells and valLstRange both = ["a","b","c","c"]
Now lastly you wanted to save the workbook and you can do this with the following:
wbk.SaveAs("C:/savedWorkbook.xlsx")
And you are done!
INFO About COM
If you have worked with VBA, .NET, VBscript or any other language to work with Excel many of these Excel methods will look the same. That is because they are all using the same library provided by Microsoft. This library uses COM, which is Microsoft's way of providing API's to programmers that are language agnostic. COM itself is an older technology and can be tricky to debug. If you want more information on Python and COM I highly recommend Python Programming on Win32 by Mark Hammond. He is the guy that gets a shoutout after you install Python on Windows in the official .msi installer.
ALTERNATIVES TO WIN32COM
I also need to point out there are several fantastic open source alternatives that can be faster than COM in most situations and work on any OS (Mac, Linux, Windows, etc.). These tools all parse the zipped files that comprise a .xlsx. If you did not know that a .xlsx file is a .zip, just change the extension to .zip and you can then explore the contents (kind of interesting to do at least once in your career). Of these I recommend Openpyxl which I have used for parsing and creating Excel files on a server where performance was critical. Never use win32com for server activities as it opens an out-of-process instance of excel.exe for each instance that can be leaky
RECOMMENDATION
I would recommend win32com for users who are working intimately with individual data sets (analysts, financial services, researchers, accountants, business operations, etc.) that are performing data discovery activities as it works great with open workbooks. However, developers or users that need to perform very large tasks with a small footprint or extremely large manipulations or processing in parallel must use a package such as openpyxl.
I'm trying to use python-gdata to populate a worksheet in a spreadsheet. The problem is, updating individual cells is woefully slow. (By doing them one at a time, each request takes about 500ms!) Thus, I'm attempting to use the batch mechanism built into gdata to speed things up.
The problem is, I can't seem to insert new cells. I've scoured the web for examples, but I couldn't find any. This is my code, which I've adapted from an example in the documentation. (The documentation does not actually say how to insert cells, but it does show how to update cells. Since this is a new worksheet, it has no cells.)
Furthermore, with debugging enabled I can see that my requests returns HTTP 200 OK.
import time
import gdata.spreadsheet
import gdata.spreadsheet.service
import gdata.spreadsheets.data
email = '<snip>'
password = '<snip>'
spreadsheet_key = '<snip>'
worksheet_id = 'od6'
spr_client = gdata.spreadsheet.service.SpreadsheetsService()
spr_client.email = email
spr_client.password = password
spr_client.source = 'Example Spreadsheet Writing Application'
spr_client.ProgrammaticLogin()
# create a cells feed and batch request
cells = spr_client.GetCellsFeed(spreadsheet_key, worksheet_id)
batchRequest = gdata.spreadsheet.SpreadsheetsCellsFeed()
# create a cell entry
cell_entry = gdata.spreadsheet.SpreadsheetsCell()
cell_entry.cell = gdata.spreadsheet.Cell(inputValue="foo", text="bar", row='1', col='1')
# add the cell entry to the batch request
batchRequest.AddInsert(cell_entry)
# submit the batch request
updated = spr_client.ExecuteBatch(batchRequest, cells.GetBatchLink().href)
My hunch is that I'm simply misunderstanding the API, and that this should work with changes. Any help is much appreciated.
I recently ran across this as well (when trying to delete) but per the docs here it doesn't appear that batch insert or delete operations are supported:
A number of batch operations can be combined into a single request.
The two types of batch operations supported are query and update.
insert and delete are not supported because the cells feed cannot be
used to insert or delete cells. Remember that the worksheets feed must
be used to do that.
I'm not sure of your use case, but would using the ListFeed help at all? It still won't let you batch operations, so there will be the associated latency, but it may be more tolerable than what you're dealing with now (or were at the time).
As of Google I/O 2016, the latest Google Sheets API supports batch cell updates (and reads). Be aware however, that GData is now deprecated, along with most GData-based APIs, including your sample above as the new API is not GData. Also putting email addresses and passwords in plain text in code is a security risk, so new(er) Google APIs use OAuth2 for authorization. You need to get the latest Google APIs Client Library for Python. It's as easy as pip install -U google-api-python-client [or pip3 for Python 3].
As far as batch insert goes, here's a simple code sample. Assume you have multiple rows of data in rows. To mass-inject this into a Sheet, say with file ID SHEET_ID & starting at the upper-left in cell A1, you'd make one call like this:
SHEETS.spreadsheets().values().update(spreadsheetId=SHEET_ID, range='A1',
body={'values': rows}, valueInputOption='RAW').execute()
If you want a longer example, see the first video below where those rows are read out of a relational database. For those new to this API, here's one code sample from the official docs to help get you kickstarted. For slightly longer, more "real-world" examples, see these videos & blog posts:
Migrating SQL data to a Sheet plus code deep dive post
Formatting text using the Sheets API plus code deep dive post
Generating slides from spreadsheet data plus code deep dive post
The latest Sheets API provides features not available in older releases, namely giving developers programmatic document-oriented access to a Sheet as if you were using the user interface (create frozen rows, perform cell formatting, resizing rows/columns, adding pivot tables, creating charts, etc.)
However, to perform file-level access on Sheets, such as import/export, copy, move, rename, etc., you'd use the Google Drive API. Examples of using the Drive API:
Exporting a Google Sheet as CSV (blogpost)
"Poor man's plain text to PDF" converter (blogpost) (*)
(*) - TL;DR: upload plain text file to Drive, import/convert to Google Docs format, then export that Doc as PDF. Post above uses Drive API v2; this follow-up post describes migrating it to Drive API v3, and here's a developer video combining both "poor man's converter" posts.