On the topic of Python PPTX Internal Hyperlink
Is there a way to create Hyperlinks prior to creating the slides that they will be linked to?
E.g. create a table of contents slide with hyperlinks to the slides that will be added later, the number of slides wont always be the same as it will be edited by the user.
Interesting question. Here's what I found working with a currently selected shape on a slide. I expect you'd be applying hyperlinks to a textrange instead, but the general idea's the same.
A hyperlink to another slide contains no .Address and the .SubAddress looks like:
SlideID,SlideIndex,SlideTitle
SlideTitle can be blank, so I tested with this to link to a third slide that wasn't there:
Activewindow.Selection.ShapeRange(1).ActionSettings(1).Hyperlink.subaddress = ",3,"
No errors, but the link doesn't work either.
This DOES work, however:
Activewindow.Selection.ShapeRange(1).ActionSettings(1).Hyperlink.subaddress = "258,3,"
I run this on a selected shape on the first slide then later add a third slide and the link jumps to it.
The trick then becomes: How will you know what the SlideID will be, this slide you haven't yet added? In a new presentation, PPT will give the first slide an ID of 256 and increment the ID for each new slide you add, but it'll be quite tricky to keep track of SlideIDs and SlideIndexes if you're adding new slides at random places in a presentation, that may have had slides added and deleted beforehand.
Personally, I think I'd add blank slides ahead of time, link to them, then add whatever content's necessary.
I doubt it. Certainly the straightforward way won't work, but it's hard to say never against human ingenuity.
Anyway, the way this mechanism works is based on the PowerPoint hyperlink behavior, there are just special "action" verbs that mean "jump internally" rather than "jump to web address". The verb in this case would be NAMED_SLIDE (although NEXT_SLIDE, LAST_SLIDE, etc. are also available). The "name" of the slide in the XML is its relationship id, basically a keyword like "rId7" for the mapping of one PPTX package part (e.g. slide) to another.
Since there isn't a slide yet, there can be no relationship. Having such a "dangling" relationship would very likely trigger a repair error, but I can't see it working out in any case. Creating a new slide will create a new relationship without regard to relationships that are already there, so best case is your "pre-creation" relationship just gets ignored.
I think you're going to need a different strategy.
Related
Currently I am working on a project that will scrape content from various similarly designed websites which contain dynamic content. My end goal is to then aggregate all this data into one application or report of sorts. I made some progress in terms of pulling the needed data from one page but my lack of experience and knowledge in this realm has left me thinking I went down the wrong path.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
The above link is the perfect example of the type of page I will be pulling from.
In my initial attempt I was able to have the page scroll to the bottom all the while collecting data from the various elements using, plus the manual scroll.
cards = driver.find_elements_by_css_selector("div[class^='product-card__Content']")
This allowed me to on the fly pull all the data points I needed, minus the overarching category, which happens to be a parent element, this is something I can map manually in excel, but would prefer to be able to have it pulled alongside everything else.
This got me thinking that maybe I should have taken a top down approach, rathen then what I am seeing as a bottom up approach? But no matter how hard I try based on advice on others I could not get it working as intended where I can pull the category from the parent div due to my lack of understanding.
Based on input of others I was able to make a pivot of sorts and using the below code, I was able to get the category as well as the product name, without any need to scroll the page at all, which went against every experience I have had with this project so far - I am unclear how/why this is possible.
for product_group_name in driver.find_elements_by_css_selector("div[class^='products-grid__ProductGroupTitle']"):
for product in driver.find_elements_by_xpath("//div[starts-with(#class,'products-grid__ProductGroup')][./div[starts-with(#class,'products-grid__ProductGroupTitle')][text()='" + product_group_name.text + "']]//div[starts-with(#class,'consumer-product-card__InViewContainer')]"):
print (product_group_name.text, product.text)
The problem with this code, which is much quicker as it does not rely on scrolling, is that no matter how I approach it I am unable to pull the additional data points of brand and price. Obviously it is something in my approach, but outside of my knowledge level currently.
Any specific or general advice would be appreciated as I would like to scale this into something a bit more robust as my knowledge builds, I would like to be able to have this scanning multiple different URLS at set points in the day, long way away from this but I want to make sure I start on the right path if possible. Based off what I have provided, is the top down approach better in this case? Bottom up? Is this subjective?
I have noticed comments about pulling the entire source code of the page and working with that, would that be a valid approach and possibly better suited to my needs? Would it even be possible based on the dynamic nature of the page?
Thank you.
After trying for 30 hours+ to implement python_-docx and docxtpl for certain functionalities (and rigulariously failing), I decided to come here for advice.
My current project exists of different pictures (.png), formatted texts (i.e. bold, shadow, font, color and so forth), etc. - now these elements need to be arranged / fit into a neat template. First, I tried pillow by creating a canvas and adding all these elements each. The solution itself is extremely prone to errors and doesn't support all the functionalities as far as text is concerned. Next off, I went by creating a .docx template (arranging pictures, text including font, style, etc.) and implementing the values this way - that worked! ... except of it not supporting more than one picture / media element per Word page!)
For demonstration purposes I tried to sketch the workflow:
Now it should be obvious why I tried Word - an easy-to-go word editing program in which I was able to format everything to my wishes (though the Python API didn't work, hence it's useless) - for demonstration purposes, here is a snippet of pseudo code:
#PSEUDO CODE
from docxtpl import DocxTemplate
tpl = DocxTemplate('file.docx')
tpl.replace_media('dummy.png', 'pic1.png')
tpl.replace_media('dummy2.png', 'pic2.png')
tpl.save('out.docx')
Depending on the setup, it either replaces None, or both pictures with one of them. According to various StackOverflow questions and threads, more than one picture isn't possible! Therefore the word approach is rather useless.
Anyhow, I'm out of knowledge. Any suggestions on how to achieve such a workflow, i.e. having an easy editable layout in which I just need to parse certain values in and get a .docx, .png, .pdf, whatever..
Can I edit the Header & Footer of an existing Presentation using python-pptx? The values I want to set are as shown in the attached image. Thanks.
I asked this a long time ago, but I can't remember where and couldn't find it on SO. Scanny answered the question, so I'm relaying his answer here (probably poorly).
By default, Python-pptx doesn't include footers or page number placeholders when listing slide placeholders. It's common practice to recommend inserting text boxes instead when these are needed, but that's not useful when dealing with multiple templates or layouts.
The first thing you'll need to add somewhere is a patch so that the placeholders are included:
def footer_patch(self):
for ph in self.placeholders:
yield ph
SlideLayout.iter_cloneable_placeholders = footer_patch
You should then be able to grab the footer from the placeholders with simple means:
footer_copy = "Hi, it's me, the footer"
elif "FOOTER" in str(shape.placeholder_format.type):
footer = slide.placeholders[shape.placeholder_format.idx]
footer_text_frame = footer.text_frame
insert_text(footer_copy, footer_text_frame)
The above is old code, and probably a poor example of how to do this, but I hope it gives a starting point. A similar approach should work for the other values you listed there. Some values, like the page number, may require additional XML editing, which you can read about in another post where Scanny was my savior.
Please note, if you're using placeholders for other tasks, adding the Footer placeholder to the list of placeholders may have unforeseen consequences.
I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!
There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.
Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx
I just read through the documentation on python-docx.
They mention several times that added content is created at the end of the document, but I didn't notice any way to alter this functionality.
Does anyone know how to add a new page to a pre-existing document, but make it page 1?
Thanks!
The short answer is the library doesn't support that just yet, although those features are high on the backlog so will be among the next to be implemented.
To get it done in the meantime you'll need to go down to the XML level with a "workaround" function. If you want to add this use case on this issue on GitHub I'll put together some workaround code you can use.
https://github.com/python-openxml/python-docx/issues/27