There's great stuff out there for handling Excel files from Python, and I think I'm just falling into a funny little crack: I need to write out a multi-worksheet workbook in the Excel 2003 XML format using pure Python (not win32com or VBA or something). Just like the poster here, I'm taking nasty proprietary files and having to spit them out in precisely the same nasty proprietary way, or else the nasty proprietary software won't take them back. I'm manipulating the data along the way, so this isn't just a format conversion; I need to be in Python to do real work on the files, and then write them back out in the same format they came in. A simpler version of the question was asked here but not directly answered.
The xlsxwriter docs have a nice summary of the current state of the art, which agrees with my own Googling: xlwt will handle the old non-XML formats, openpyxl specifically does Excel 2010 formats, xlsxwriter itself is for 2007+, pythonOffice hasn't been touched since 2012.
Please tell me I don't have to parse everything manually with BeautifulSoup or something to get back to Excel 2003! I can use Python 2, or 3, or both, if needed. Thanks. These are the relevant bits of namespace:
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
...
</DocumentProperties>
<ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">
I've also been dealing with similarly annoying proprietary files. After doing a lot of digging through all of the same python excel extensions, I've also come to the conclusion that yes you will have to parse the xml file manually.
I had the same problem, but found an answer that I think both :
Faster to implement
Easier to modify, and have evolve over time (expecially if your CSV changes like having new columns added)
I had to convert a regulat CSV to .xml in SpreadsheetML format (XML Spreadsheet 2003), and I found a nice tutorial on how to do it using many ways. For Python 3, I chose ffe (Flat File Extractor)
The essential:
To use ffe, you must be running is in linux environment and install it using $sudo apt-get install ffe for example. (FYI: There is also a binary for windows)
You need to create a .fferc configurations file in a specific format acting like a XML template (see article or documentation links provided)
You can then convert your input csv file into an output xml file using a bash/shell command line as such: $ffe -o output.xml -c csv2xml.fferc input.csv
If you want to prototype quickly, I succeded in a Google Colab Notebook. You can install it using the sudo command I provided above.
Happy coding!
Link to full article : Converting CSV to XML on Ubuntu Communiti Wiki.
Related
I want to read VFP vcx files as plaintext in python. Any tips on how I should go about it?
I understand that the mime type of the file is an octet stream, which is typically associated with binary files. Also apparent is that VFP uses vcx file in combination with vct files to display the initial Source code.
I have been trying some static code analysis methods to extract the information that I need from the vct file, but I had no luck since the control characters mess up even the legible parts of the vct file, which is very hard to automate.
I have searched for weeks. This is my last resort before going into VFP and scraping it manually.
Any help is mich appreciated.
You have a few options:
Fernando Bozzo's Foxbin is a github repository with some VFP code to convert vcx, scx ... to prg files.
In VFP tools menu there is View Class Code option
There is scctext that ships with VFP.
All the above generate VFP prg files which are text. But probably that is not what you meant. Then you could simply open a vcx as a table (it is a table with a vcx extension) and read all the object names, properties, methods and such.
I want to read VFP vcx files as plaintext in python
For what it's worth, *.vcx / *.vct files are just renamed dBase/xBase *.DBF/ *.FPT file pairs, just like *.scx \ *.sct VFP Form files. So you could probably use something like dbfread
I'm looking for a way to convert excel to html while preserving formatting.
I know this is doable on windows due to the availability of some underlying win32 libraries, (eg via xlwings
Python - Excel to HTML (keeping format))
But I'm looking for a solution on Linux.
I've also come by Aspose Cells but this requires a paid license or else it will add a lot of extra junk to the output that needs to be scrubbed out.
And lastly I tried the python lib xlsx2html but it does a very poor job at preserving formatting.
Are there any suggestions for a Linux based solution? I'd also be interested in tools written in other languages that can be easily wrapped around via python.
Thanks in advance!
Update:
Here is an example of a random excel sheet I converted via excel itself that I would like to reproduce. It has some colors, some border variations, some merged cells and some font sizes to see if they all work.
You can use LibreOffice to convert an Excel file to a HTML file using the command line:
# --convert-to implies --headless so it's not mandatory to specify --headless
soffice --headless --convert-to html data.xlsx
You can refer to the documentation to know more about other CLI parameters.
I think you should search for Excel to HTML in the JS world not python (I am not saying it is not possible, but It's more usual in JS), I promise you will get better results.
In my opinion, finding a JS-based solution and make a python wrapper can be more helpful. Because in JS community, they struggled more than another communities to import and work with Excels.
Another idea is to change your approach, look for how you can import an Excel file in an embedded way or iframe inside an HTML page with JS and then export it.
But again, I highly recommend to check JS libraries or GitHub repositories, some of them care about formatting.
I have a reporting module in my Django app that gives the user the ability to see their reports on screen or to export them and have the export opened by Excel.
The export is a cheat. I take the exact same output as the screen version and save it to a file with an .xls extension and
response = HttpResponse(body, content_type='application/vnd.ms-excel')
and badda-boom, badda-bing I have an Excel file that is lightly formatted, i.e. it respects the css styling that I've applied.
The nice thing for the user is that the file auto-opens in Excel; there aren't any extra steps for them. (find the download, import a text file, etc.)
Unfortunately it looks like Excel 2016 has decided (I'm guessing) that that's a security issue and no longer opens the file.
I'm aware of various python -> Excel tools. openpyxl looks promising. But that's going to require me to touch each report.
So, what I'm looking for is something that would give me what I have now, take an html file and have Excel open it as a native file and recognize the existing formatting.
The behavior change has been noted by Microsoft and there are work arounds, for the user:
https://support.microsoft.com/en-us/kb/3181507
It sounds like they're working on a fix.
I was wondering whether it would be possible to write a python script that retrieves header information from an .exe file. I tried googling but didn't really find any results that were usable.
Thanks.
Sept
There is pefile : multi-platform Python module to read and work with Portable Executable (aka PE) files. Most of the information in the PE Header is accessible, as well as all the sections, section's information and data.
Looks like I'm almost 2 years and a dollar short! If you still need to solve this, MichaĆ Niklas was right on point above. pefile was written for this very purpose. Here is an example from my interactive session:
ipython
import pefile
pe = pefile.PE('file.exe')
pe.print_info()
The output is too verbose to put up here, but the above gives you all header information from a PE.
Download pefile here: pefile
Of course it is possible to write a Python script to retrieve header information from an XYZ file. Three simple steps:
(1) Find docs for the header part of an XYZ file; read them.
(2) Read the docs for the Python struct module or ctypes module or both.
(3) Write and test the script.
Which step are you having trouble with?
I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson
Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.
Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.
You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.
I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table
Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.