Extract presenter notes from PPTx file (powerpoint)

Extract presenter notes from PPTx file (powerpoint) - python

Is there a solution in python or php that will allow me to get the presenter notes from each slide in a power point file?
Thank you

You can use python-pptx.
pip install python-pptx
You can do the following to extract presenter notes:
import collections
import collections.abc
from pptx import Presentation
file = 'path/to/presentation.pptx'
ppt=Presentation(file)
notes = []
for page, slide in enumerate(ppt.slides):
# this is the notes that doesn't appear on the ppt slide,
# but really the 'presenter' note.
textNote = slide.notes_slide.notes_text_frame.text
notes.append((page,textNote))
print(notes)
The notes list will contain all notes on different pages.
If you want to extract text content on a slide, you need to do this:
for page, slide in enumerate(ppt.slides):
temp = []
for shape in slide.shapes:
# this will extract all text in text boxes on the slide.
if shape.has_text_frame and shape.text.strip():
temp.append(shape.text)
notes.append((page,temp))

Related

How to extract images, video and audio from a pdf file using python

I need a python program that can extract videos audio and images from a pdf. I have tried using libraries such as PyPDF2 and Pillow, but I was unable to get all three to work let alone one.

I think you could achieve this using pymupdf.
To extract images see the following: https://pymupdf.readthedocs.io/en/latest/recipes-images.html#how-to-extract-images-pdf-documents
For Sound and Video these are essentially Annotation types.
The following "annots" function would get all the annotations of a specific type for a PDF page:
https://pymupdf.readthedocs.io/en/latest/page.html#Page.annots
Annotation types are as follows:
https://pymupdf.readthedocs.io/en/latest/vars.html#annotationtypes
Once you have acquired an annotation I think you can use the get_file method to extract the content ( see: https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.get_file)
Hope this helps!

#George Davis-Diver can you please let me have an example PDF with video?
Sounds and videos are embedded in their specific annotation types. Both are no FileAttachment annotation, so the respective mathods cannot be used.
For a sound annotation, you must use `annot.get_sound()`` which returns a dictionary where one of the keys is the binary sound stream.
Images on the other hand may for sure be embedded as FileAttachment annotations - but this is unusual. Normally they are displayed on the page independently. Find out a page's images like this:
import fitz
from pprint import pprint
doc=fitz.open("your.pdf")
page=doc[0] # first page - use 0-based page numbers
pprint(page.get_images())
[(1114, 0, 1200, 1200, 8, 'DeviceRGB', '', 'Im1', 'FlateDecode')]
# extract the image stored under xref 1114:
img = doc.extract_image(1114)
This is a dictionary with image metadata and the binary image stream.
Note that PDF stores transparency data of an image separately, which therefore needs some additional care - but let us postpone this until actually happening.
Extracting video from RichMedia annotations is currently possible in PyMuPDF low-level code only.
#George Davis-Diver - thanks for example file!
Here is code that extracts video content:
import sys
import pathlib
import fitz
doc = fitz.open("vid.pdf") # open PDF
page = doc[0] # load desired page (0-based)
annot = page.first_annot # access the desired annot (first one in example)
if annot.type[0] != fitz.PDF_ANNOT_RICH_MEDIA:
print(f"Annotation type is {annot.type[1]}")
print("Only support RichMedia currently")
sys.exit()
cont = doc.xref_get_key(annot.xref, "RichMediaContent/Assets/Names")
if cont[0] != "array": # should be PDF array
sys.exit("unexpected: RichMediaContent/Assets/Names is no array")
array = cont[1][1:-1] # remove array delimiters
# jump over the name / title: we will get it later
if array[0] == "(":
i = array.find(")")
else:
i = array.find(">")
xref = array[i + 1 :] # here is the xref of the actual video stream
if not xref.endswith(" 0 R"):
sys.exit("media contents array has more than one entry")
xref = int(xref[:-4]) # xref of video stream file
video_filename = doc.xref_get_key(xref, "F")[1]
video_xref = doc.xref_get_key(xref, "EF/F")[1]
video_xref = int(video_xref.split()[0])
video_stream = doc.xref_stream_raw(video_xref)
pathlib.Path(video_filename).write_bytes(video_stream)

Extracting Powerpoint background images using python-pptx

I have several powerpoints that I need to shuffle through programmatically and extract images from. The images then need to be converted into OpenCV format for later processing/analysis. I have done this successfully for images in the pptx, using:
for slide in presentation:
for shape in slide.shapes
if 'Picture' in shape.name:
pic_list.append(shape)
for extraction, and:
img = cv2.imdecode(np.frombuffer(page[i].image.blob, np.uint8), cv2.IMREAD_COLOR)
for python-pptx Picture to OpenCV conversion. However, I am having a lot of trouble extracting and manipulating the backgrounds in a similar fashion.
slide.background
is sufficient to extract a "_Background" object, but I have not found a good way to convert it into a OpenCV object similar to Pictures. Does anyone know how to do this? I am using python-pptx for extraction, but am not adverse to other packages if it's not possible with that package.

After a fair bit of work I discovered how to do this -- i.e., you don't. As far as I can tell, there is no way to directly extract the backgrounds with either python-pptx or Aspose. Powerpoint -- which, as it turns out, is an archive that can be unzipped with 7zip -- keeps its backgrounds disassembled in the ppt/media (pics), ppt/slideLayouts and ppt/slideMasters (text, formatting), and they are only put together by the Powerpoint renderer. This means that to extract the backgrounds as displayed, you basically need to run Powerpoint and take pics of the slides after removing text/pictures/etc. from the foreground.
I did not need to do this, as I just needed to extract text from the backgrounds. This can be done by checking slideLayouts and slideMasters XMLs using BeautifulSoup, at the <a:t> tag. The code to do this is pretty simple:
import zipfile
with zipfile.ZipFile(pptx_path, 'r') as zip_ref:
zip_ref.extractall(extraction_directory)
This will extract the .pptx into its component files.
from glob import glob
layouts = glob(os.path.join(extr_dir, 'ppt\slideLayouts\*.xml'))
masters = glob(os.path.join(extr_dir, 'ppt\slideMasters\*.xml'))
files = layouts + masters
This gets you the paths for slide layouts/masters.
from bs4 import BeautifulSoup
text_list = []
for file in files:
with open(file) as f:
data = f.read()
bs_data = BeautifulSoup(data, "xml")
bs_a_t = bs_data.find_all('a:t')
for a_t in bs_a_t:
text_list.append(str(a_t.contents[0]))
This will get you the actual text from the XMLs.
Hopefully this will be useful to someone else in the future.

Adding video in ppt using python cannot be played

I have to insert a video in ppt using python-pptx library. And also i have added the below code to insert it:
from pptx import Presentation
from pptx.util import Inches
prs = Presentation('template.pptx')
filepath = "file2.pptx"
layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(layout9)
path = 'video.mp4'
movie=slide.shapes.add_movie(path
, Inches(4.51), Inches(1.53), Inches(6.98),
Inches(4.69),poster_frame_image=None,mime_type='video/unknown'
)
prs.save(filepath)
This code successfully creates video shape. It show a big speaker icon when i click it for preview, it doesn't play at all . I dont know what i missed here. If anyone could help me please give some suggestion for this.

Aspose.Slides for Python makes it easy to add video to a presentation slide. The following code example shows you how to do this:
import aspose.slides as slides
with slides.Presentation() as presentation:
slide = presentation.slides[0]
# Add a video frame to the slide.
video_frame = slide.shapes.add_video_frame(20, 20, 400, 300, "video.mp4")
# Add a poster image to presentation resources.
with open("poster.png", "rb") as poster_stream:
poster_image = presentation.images.add_image(poster_stream)
# Set the poster for the video.
video_frame.picture_format.picture.image = poster_image
presentation.save("example.pptx", slides.export.SaveFormat.PPTX)
This a paid product, but you can get a temporary license to evaluate all features of this library. I work as a Support Developer at Aspose.

How Could I replace Text in a PPP with Python pptx?

I want to replace the text in a textbox in Powerpoint with Python-pptx.
Everything I found online didn't work for me and the documentation isn't that helpful for me.
So I have a Textbox with the Text:
$$Name 1$$
$$Name 2$$
and I want to change the $$Name1 $$ to Tom.
How can I achieve that?

A TextFrame object defined in python-pptx helps in manipulating contents of a textbox. You can do something like:
from python-pptx import Presentation
"""open file"""
prs = Presentaion('pptfile.pptx')
"""get to the required slide"""
slide = prs.slides[0]
"""Find required text box"""
for shape in slide.shapes:
if not shape.has_text_frame:
continue
text_frame = shape.text_frame
if "Name 1" == text_frame.text:
text_frame.text = "Tom"
"""save the file"""
prs.save("pptfile.pptx")

Try this:
import pptx
input_pptx = "Input File Path"
prs = pptx.Presentation((input_pptx))
testString = "$$Name1 $$"
replaceString = 'Tom'
title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
for slide in prs.slides:
for shape in slide.shapes:
if shape.has_text_frame:
if(shape.text.find(testString))!=-1:
shape.text = shape.text.replace(testString, replaceString)
if not shape.has_table:
continue
prs.save('C:/test.pptx')

Ok thank you. I just found out, that my PowerPoint example was totaly messed up. No everything works fine with a new PowerPoint blanked

In order to keep original formatting, we need to replace the text at the run level.
from pptx import Presentation
ppt = Presentation(file_path_of_pptx)
search_str = '$$Name1$$'
replace_str = 'Tom'
for slide in ppt.slides:
for shape in slide.shapes:
if shape.has_text_frame:
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
print(run.text)
if(run.text.find(search_str))!=-1:
run.text = run.text.replace(search_str, replace_str)
ppt.save(file_path_of_new_pptx)

Since PowerPoint splits the text of a paragraph into seemingly random runs (and on top each run carries its own - possibly different - character formatting) you can not just look for the text in every run, because the text may actually be distributed over a couple of runs and in each of those you'll only find part of the text you are looking for.
Doing it at the paragraph level is possible, but you'll lose all character formatting of that paragraph, which might screw up your presentation quite a bit.
Using the text on paragraph level, doing the replacement and assigning that result to the paragraph's first run while removing the other runs from the paragraph is better, but will change the character formatting of all runs to that of the first one, again screwing around in places, where it shouldn't.
Therefore I wrote a rather comprehensive script that can be installed with
python -m pip install python-pptx-text-replacer
and that creates a command python-pptx-text-replacer that you can use to do those replacements from the command line, or you can use the class TextReplacer in that package in your own Python scripts. It is able to change text in tables, charts and wherever else some text might appear, while preserving any character formatting specified for that text.
Read the README.md at https://github.com/fschaeck/python-pptx-text-replacer for more detailed information on usage. And open an issue there if you got any problems with the code!
Also see my answer at python-pptx - How to replace keyword across multiple runs? for an example of how the script deals with character formatting...

Python Insert Image into the middle of an existing PowerPoint

I have an existing PowerPoint presentation with 20 slides. This presentation serves as an template with each slide having different backgrounds. I want to take this (existing) PowerPoint presentation, insert an image in slide number 4 (do nothing with the first 3) and save it as a new PowerPoint presentation.
This is what I have up until now. This code loads an existing presentation and saves it as a new one. Now I just need to know how to use this to insert an image to slide number 4 like described above.
Note: I am using normal Python.
from pptx import Presentation
def open_PowerPoint_Presentation(oldFileName, newFileName):
prs = Presentation(oldFileName)
#Here I guess I need to type something to complete the task.
prs.save(newFileName)
open_PowerPoint_Presentation('Template.pptx', 'NewTemplate.pptx')

I'm not really familiar with this module, but I looked at their quickstart
from pptx.util import Inches
from pptx import Presentation
def open_PowerPoint_Presentation(oldFileName, newFileName, img, left, top):
prs = Presentation(oldFileName)
slide = prs.slides[3]
pic = slide.shapes.add_picture(img, left, top)
prs.save(newFileName)
open_PowerPoint_Presentation('Template.pptx', 'NewTemplate.pptx',
'mypic.png', Inches(1), Inches(1))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract presenter notes from PPTx file (powerpoint) - python

Is there a solution in python or php that will allow me to get the presenter notes from each slide in a power point file? Thank you

Related

How to extract images, video and audio from a pdf file using python

Extracting Powerpoint background images using python-pptx

Adding video in ppt using python cannot be played

How Could I replace Text in a PPP with Python pptx?

Python Insert Image into the middle of an existing PowerPoint

Categories

Resources