I would like to create a script that opens Visio files (.vsd), save it to vsdx, pdf and svg (with every page of vsd being seperate file), close the file, opens the next until end of files.
So far i was successful with saving it to .pdf using: Python Visio to pdf
import win32com.client
#change later to dynamic current path
path= r"C:/automation_visio/"
visio = win32com.client.Dispatch("Visio.Application")
doc = visio.Documents.Open(path+'test.vsd')
doc.ExportAsFixedFormat( 1, path+'test.pdf', 1, 0 ) #exports as pdf only XD
I looked in lots of places (most relevant: https://learn.microsoft.com/en-us/office/vba/api/visio.document.saveas) but to no avail - I don't know how to save to other filetypes which are available with manual "SaveAs".
EDIT: I need to know also how to navigate trough the pages (to get list of pages and iterate trough them and save to svg files) and (shamefully) how to correctly close the file after files are exported.
You need to use page.Export method instead of ExportAsFixedFormat. Just give the target file the .svg extension, and you are good to go.
BTW, I have a Visio Add-in (check the profile) that adds some useful stuff to the export, like connections, properties, etc, to be used from JavaScript. And it is also callable programmatically.
I wanted to add that there are two options for exporting SVG from Visio. Normally, Visio adds a bunch of extra data, such as User-defined cells, Layers, and Shape Data fields. This can be useful if you want to program against the export, or re-import to Visio at some time in the future.
However, if you want small, clean SVG, you won't want all of that extra stuff. So you can fiddle with Visio.Application.ApplicationSettings.SVGExportFormat, setting it equal to one of the following:
// (0) Include both SVG elements and Visio elements. This is the default.
Visio.VisSVGExportFormat.visSVGIncludeVisioElements
// (1) Include SVG elements only.
Visio.VisSVGExportFormat.visSVGExcludeVisioElements
The extra Visio info that is added to the SVG export is easy to find, just look for elements with the "v:" prefix.
Related
Good morning all,
I've made a Python script that adds text on top of images, based on a preset template. I'm now developing a template editor that will let the user edit the template in GUI, then save the template as a config file. The idea is that one user can create a template, export it, send it to a new user on a separate computer, who can import it into their config file. The second user will retain full edit abilities on the template (if any changes needs to be made).
Now, in addition to the text, I also want the ability to add up to two images (company logos, ect.) to the template/stills. Now, my question: Is there a way to convert a JPG to pure text data, that can be saved to a config file, and that can be reinterpreted to a JPG at the receiving system. And if not, what would be the best way to achieve this? What I'm hoping to avoid is the user having to send the image files separately.
Sounds questionable that you want to ship an image as text file (it's easy, base64 is supplied with python, but it drastically increases the amount of bytes. I'd strongly recommend not doing that).
I'd rather take the text and embed it in the image metadata! That way, you would still have a valid image file, but if loaded with your application, that application could read the metadata, interpret it as text config.
There's EXIF and XMP metadata, for both there's python modules.
Alternatively, would make more sense to simply put images and config files into one archive file (you know .docx word documents? They do exactly that, just like .odt; java jar files? Same. Android APK files? All archive files with multiple files inside) python brings a zip module to enable you to do that easily.
Instead of an archive, you could also build a PDF file. That way, you could simply have the images embedded in the PDF, the text editable on top of it, any browser can display it, and the text stays editable. Operating on pdf files can be done in many ways, but I like Fitz from the PyMuPDF package. Just make a document the size of your image, add the image file, put the text on top. On the reader side, find the image and text elements. It's relatively ok to do!
PDF is a very flexible format, if you need more config that just text information, you can add arbitrary text streams to the document that are not displayed.
If I understand properly, you want to use the config file as a settings file that stores the preferences of a user, you could store such data as JSON/XML/YAML or similar, such files are used to store data in pure readable text than binary can be parsed into a Python dict object. As for storing the images, you can have the generated images uploaded to a server then use their URL when they are needed to re-download them, unless if I didn’t understand the question?
I am creating multiple powerpoint decks that have data modified but I need to be able to hit this "refresh slides" button that accesses the updated data from a connected website. Is there a way to do this automatically in python?
No, there is not. However, there is a workaround:
You have to put your data in single .xslx (Excel) files
Use this data within you PowerPointPresentation (see here for a description)
Write a script to refresh your data and output them into the .xlsx files
If you have done everything correctly, next time you open PowerPoint it will ask you, whether or not it should refresh the tables/graphs.
That is the only work around I have found in order to refresh PowerPointPresentations.
I will explain my dilemma first: I have several thousand powerpoint files (.ppt) that I need to extract the text. The problem is the text is is disorganized in the file and when read as a complete page it makes no sense for what I need (it would read in the example: line 1, line 3, line 2, line 4, line 5).
I was using tika to read the files initially. I then thought if I converted to pdf using glob and win32com.client that I would have some better luck but it's basically the same result. The picture here is an example of what the text is like.
So now my idea now is if I can section the pdf or ppt by pixel location (and save to separate temp files if needed, opened, and read that way) I can keep things in order and get what I need. Although the text moves around within each box, the black outline boxes are always roughly in the same location.
I cannot find anything to split an individual pdf page though, only multiple pages into a single page. Does anyone have an idea how to go about doing this?
I need to read the text in box one together (line 1 and line 2) and load into a dictionary or some other container, and the same for the second box. For reference there is only one slide in the powerpoint.
Allow me to provide the answer as a general guideline:
Both .ppt and .pptx files are glorified .zip files.
Use 7-zip or WinZip to open the .pptx and understand the structure.
Convert them into a .pptx file.
Each slide should now have a .xml file full of tags you can parse.
For example you will find tags for each text box with tags for that box's text nested inside.
Also: python-pptx
Mass convert by tweaking this VBA code: Link for VBA
Or using PowerShell: Link for [PowerShell]
I'm using pyPdf to merge several PDF files into one. This works great, but I would also need to add a table of contents/outlines/bookmarks to the PDF file that is generated.
pyPdf seems to have only read support for outlines. Reportlab would allow me to create them, but the opensource version does not support loading PDF files, so that doesn't work to add outlines to an existing file.
Is there any way I can add outlines to an existing PDF using Python, or any library that would allow that?
https://github.com/yutayamamoto/pdfoutline
I made a python library just for adding an outline to an existing PDF file.
It looks like pypdf can do the job. See the add_outline_item method in the documentation.
We had a similar problem in WeasyPrint: cairo produces the PDF files but does not support bookmarks/outlines or hyperlinks. In the end we bit the bullet, read the PDF spec, and did it ourselves.
WeasyPrint’s pdf.py has a simple PDF parser and writer that can add/override PDF "objects" to an existing documents. It uses the PDF "update" mechanism and only append at the end of the file.
This module was made for internal use only but I’m open to refactoring it to make it easier to use in other projects.
However the parser takes a few shortcuts and can not parse all valid PDF files. It may need to be adapted if PyPDF’s output is not as nice as cairo’s. From the module’s docstring:
Rather than trying to parse any valid PDF, we make some assumptions
that hold for cairo in order to simplify the code:
All newlines are '\n', not '\r' or '\r\n'
Except for number 0 (which is always free) there is no "free" object.
Most white space separators are made of a single 0x20 space.
Indirect dictionary objects do not contain '>>' at the start of a line except to mark the end of the object, followed by 'endobj'. (In
other words, '>>' markers for sub-dictionaries are indented.)
The Page Tree is flat: all kids of the root page node are page objects, not page tree nodes.
I have a WSGI application that generates invoices and stores them as PDF.
So far I have solved similar problems with FPDF (or equivalents), generating the PDF from scratch like a GUI. Sadly this means the entire formatting logic (positioning headers, footers and content, styling) is in the application, where it really shouldn't be.
As the templates already exist in Office formats (ODT, DOC, DOCX), I would prefer to simply use those as a basis and fill in the actual content. I've found the Appy framework, which does pretty much that with annotated ODT files.
That still leaves the bigger problem open, tho: converting ODT (or DOC, or DOCX) to PDF. On a server. Running Linux. Without GUI libraries. And thus, without OO.o or MS Office.
Is this at all possible or am I better off keeping the styling in my code?
The actual content that would be filled in is actually quite restricted: a few paragraphs, some of which may be optional, a headline or two, always at the same place, and a few rows of a table. In HTML this would be trivial.
EDIT: Basically, I want a library that can generate ODT files from ODF files acting as templates and a library that can convert the result into PDF (which is probably the crux).
I don't know how to go about automatic ODT -> PDF conversion, but a simpler route might be to generate your invoices as HTML and convert them to PDF using http://www.xhtml2pdf.com/. I haven't tried the library myself, but it definitely seems promising.
You can use QTextDocument, QTextCursor and QTextDocumentWriter in PyQt4. A simple example to show how to write to an odt file:
>>>from pyqt4 import QtGui
# Create a document object
>>>doc = QtGui.QTextDocument()
# Create a cursor pointing to the beginning of the document
>>>cursor = QtGui.QTextCursor(doc)
# Insert some text
>>>cursor.insertText('Hello world')
# Create a writer to save the document
>>>writer = QtGui.QTextDocumentWriter()
>>>writer.supportedDocumentFormats()
[PyQt4.QtCore.QByteArray(b'HTML'), PyQt4.QtCore.QByteArray(b'ODF'), PyQt4.QtCore.QByteArray(b'plaintext')]
>>>odf_format = writer.supportedDocumentFormats()[1]
>>>writer.setFormat(odf_format)
>>>writer.setFileName('hello_world.odt')
>>>writer.write(doc) # Return True if successful
True
If not sure the difference between odt and odf in this case. I checked the file type and it said 'application/vnd.oasis.opendocument.text'. So I assume it is odt. You can print to a pdf file by using QPrinter.
More information at:
http://qt-project.org/doc/qt-4.8/