Edit powerpoint .pptx template by python - python

I am trying to automate report generating in PowerPoint by python. I wanted to know if there is any way to detect an existing textbox from a PowerPoint template and then fill it with some text in python?

Main logic is that how to find a placeholder which is given by template by default as well as text-box on non-template-pages. We can take different type
to extract data and fill placeholder and text-box like From txt file, form web scraping and many more. Among them we have taken our data list_ object.
1. Lets we n page and we are accessing page 1 so we can access this page usng this code :
(pptx.Presentation(inout_pptx)).slides[0]
2. To select placeholder by default provided in template we will use this code and we will iterator over all placehodler
slide.shapes
3. To update particular placeholder use this :
shape.text_frame.text = data
CODE :
import pptx
inout_pptx = r"C:\\Users\\lenovo\\Desktop\\StackOverFlow\\python_pptx.pptx"
list_data = [
'Quantam Computing dsfsf ',
'Welcome to Quantam Computing Tutorial, hope you will get new thing',
'User_Name sd',
'<Enrollment Number>']
"""open file"""
prs = pptx.Presentation(inout_pptx)
"""get to the required slide"""
slide = prs.slides[0]
"""Find required text box"""
for shape, data in zip(slide.shapes, list_data):
if not shape.has_text_frame:
continue
shape.text_frame.text = data
"""save the file"""
prs.save(inout_pptx)
RESULTS :

If I understand correctly, your presentation contains placeholders for text filling. The following code example shows you how to fill a footer on the first slide with Aspose.Slides for Python via .NET:
import aspose.slides as slides
with slides.Presentation("example.pptx") as presentation:
firstSlide = presentation.slides[0]
for shape in firstSlide.shapes:
# AutoShape objects have text frames
if (isinstance(shape, slides.AutoShape) and shape.placeholder is not None):
if shape.placeholder.type == slides.PlaceholderType.FOOTER:
shape.text_frame.text = "My footer text"
presentation.save("example_out.pptx", slides.export.SaveFormat.PPTX)
I work as a Support Developer at Aspose.

Related

Delete text from pdf using PyMUPDF

I need to remove the text "DRAFT" from a pdf document using Python. I can find the text box containing the text but can't find an example of how to edit the pdf text element using pymupdf.
In the example below the draft object contains the coords and text for the DRAFT text element.
import fitz
fname = r"original.pdf"
doc = fitz.open(fname)
page = doc.load_page(0)
draft = page.search_for("DRAFT")
# insert code here to delete the DRAFT text or replace it with an empty string
out_fname = r"final.pdf"
doc.save(out_fname)
Added 4/28/2022
I found a way to delete the text but unfortunately it also deletes any overlapping text underneath the box around DRAFT. I really just want to delete the DRAFT letters without modifying underlying layers
# insert code here to delete the DRAFT text or replace it with an empty string
rl = page.search_for("DRAFT", quads = True)
page.add_redact_annot(rl[0])
page.apply_redactions()
You can try this.
import fitz
doc = fitz.open("xxxx")
for page in doc:
for xref in page.get_contents():
stream = doc.xref_stream(xref).replace(b'The string to delete', b'')
doc.update_stream(xref, stream)

How do I create a PDF file containing a Signature Field, using python?

In order to be able to sign a PDF document using a token based DSC, I need a so-called signature field in my PDF.
This is a rectangular field you can fill with a digital signature using e.g. Adobe Reader or Adobe Acrobat.
I want to create this signable PDF in Python.
I'm starting from plain text, or a rich-text document (Image & Text) in .docx format.
How do I generate a PDF file with this field, in Python?
Check out pyHanko. You can add, edit and digitally sign PDFs using Python.
https://github.com/MatthiasValvekens/pyHanko
It's totally free. And if you have any problems, Matthias is very helpful and responsive.
Unfortunately, I couldn't find any (free) solutions. Just Python programs that sign PDF documents.
But there is a Python PDF SDK called PDFTron that has a free trial. Here's a link to a specific article showing how to "add a certification signature field to a PDF document and sign it".
# Open an existing PDF
doc = PDFDoc(docpath)
page1 = doc.GetPage(1)
# Create a text field that we can lock using the field permissions feature.
annot1 = TextWidget.Create(doc.GetSDFDoc(), Rect(50, 550, 350, 600), "asdf_test_field")
page1.AnnotPushBack(annot1)
# Create a new signature form field in the PDFDoc. The name argument is optional;
# leaving it empty causes it to be auto-generated. However, you may need the name for later.
# Acrobat doesn't show digsigfield in side panel if it's without a widget. Using a
# Rect with 0 width and 0 height, or setting the NoPrint/Invisible flags makes it invisible.
certification_sig_field = doc.CreateDigitalSignatureField(cert_field_name)
widgetAnnot = SignatureWidget.Create(doc, Rect(0, 100, 200, 150), certification_sig_field)
page1.AnnotPushBack(widgetAnnot)
...
# Save the PDFDoc. Once the method below is called, PDFNet will also sign the document using the information provided.
doc.Save(outpath, 0)
you can use https://github.com/mstamy2/PyPDF2 for PDF generation with python code.
and then use open source Java-Digital-Signature: Java command line tool for digital signature with PKCS#11 token: https://github.com/AlessioScarfone/Java-Digital-Signature
and call on your python code:
import subprocess
subprocess.call(['java', '-jar', 'signer.jar', 'pades', 'test.pdf'])
I use signpdf library of python to sign pdf.
Read this document for better understanding https://github.com/yourcelf/signpdf
pip install signpdf
Demo:
Sign the first page of "contract.pdf" with the signature "sig.png": ->
signpdf contract.pdf sig.png --coords 1x100x100x150x40
Understand Co-ordinates: Github link

python-docx does not add picture

I'm trying to insert a picture into a Word document using python-docx but running into errors.
The code is simply:
document.add_picture("test.jpg", width = Cm(2.0))
From looking at the python-docx documentation I can see that the following XML should be generated:
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="1" name="python-powered.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId7"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="859536" cy="343814"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
This does in fact get generated in my document.xml file. (When unzipping the docx file). However looking into the OOXML format I can see that the image should also be saved under the media folder and the relationship should be mapped in word/_rels/document.xml:
<Relationship Id="rId20"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
Target="media/image20.png"/>
None of this is happens however, and when I open the Word document I'm met with a "The picture can't be displayed" placeholder.
Can anyone help me understand what is going on?
It looks like the image is not embedded the way it should be and I need to insert it in the media folder and add the mapping for it, however as a well documented feature this should be working as expected.
UPDATE:
Testing it out with an empty docx file that image does get added as expected which leads me to believe it might have something to do with the python-docx-template library. (https://github.com/elapouya/python-docx-template)
It uses python-docx and jinja to allow templating capabilities but runs and works the same way python-docx should. I added the image to a subdoc which then gets inserted into a full document at a given place.
A sample code can be seen below (from https://github.com/elapouya/python-docx-template/blob/master/tests/subdoc.py):
from docxtpl import DocxTemplate
from docx.shared import Inches
tpl=DocxTemplate('test_files/subdoc_tpl.docx')
sd = tpl.new_subdoc()
sd.add_paragraph('A picture :')
sd.add_picture('test_files/python_logo.png', width=Inches(1.25))
context = {
'mysubdoc' : sd,
}
tpl.render(context)
tpl.save('test_files/subdoc.docx')
I'll keep this up in case anyone else manages to make the same mistake as I did :) I managed to debug it in the end.
The problem was in how I used the python-docx-template library. I opened up a DocxTemplate like so:
report_output = DocxTemplate(template_path)
DoThings(value,template_path)
report_output.render(dictionary)
report_output.save(output_path)
But I accidentally opened it up twice. Instead of passing the template to a function, when working with it, I passed a path to it and opened it again when creating subdocs and building them.
def DoThings(data,template_path):
doc = DocxTemplate(template_path)
temp_finding = doc.new_subdoc()
#DO THINGS
Finally after I had the subdocs built, I rendered the first template which seemed to work fine for paragraphs and such but I'm guessing the images were added to the "second" opened template and not to the first one that I was actually rendering. After passing the template to the function it started working as expected!
I came acrossed with this problem and it was solved after the parameter width=(1.0) in method add_picture removed.
when parameter width=(1.0) was added, I could not see the pic in test.docx
so, it MIGHT BE resulted from an unappropriate size was set to the picture,
to add pictures, headings, paragraphs to existing document:
doc = Document(full_path) # open an existing document with existing styles
for row in tableData: # list from the json api ...
print ('row {}'.format(row))
level = row['level']
levelStyle = 'Heading ' + str(level)
title = row['title']
heading = doc.add_heading( title , level)
heading.style = doc.styles[levelStyle]
p = doc.add_paragraph(row['description'])
if row['img_http_path']:
ip = doc.add_paragraph()
r = ip.add_run()
r.add_text(row['img_name'])
r.add_text("\n")
r.add_picture(row['img_http_path'], width = Cm(15.0))
doc.save(full_path)

pdfminer doesn't extract data from filled-out pdf form

I'm trying to use pdfminer to extract the filled-out contents in a pdf form. The instructions for accessing the pdf are:
Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1073757&parDT_END=99991231
Click "Create Report" next to the fourth report from the top (i.e.,Banking Organization Systemic Risk Report (FR Y-15))
Click "Your request for a financial report is ready"
To extract the contents in blue, I copied code from this post:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
filename = 'FRY15_1073757_20160630.PDF'
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print '{0}: {1}'.format(name, value)
This didn't extract the data fields as expected -- nothing was printed. I tried the same code on another pdf and it worked so I suspect the failure might have to do with the security setting of the first pdf, which is shown below
For the second pdf on which the code worked, the security setting shows "Allowed" for all the actions. I also tried using pdfminer's pdf2txt.py functionality (see here) but the filled-out data in the fields in the original pdf form (which is what I want) was not in the converted text file; only the "flat" non-fillable part of the pdf was converted. Interestingly, if I use Adobe Reader's Save As Text to convert the pdf to a text file, the fillable part was in the converted text file. This is what I've been doing to get around the failed code.
Any idea how I can extract data directly from the pdf form? Thanks.
I can only explain what the problem is but cannot present a solution because I have no working Python knowledge.
Your code iterates over the immediate children of the AcroForm Fields array and expect them to represent the form fields.
While this expectation often is fulfilled, it actually only represents a special case: Form fields are arranged as a tree structure with that Fields array as root element, e.g. in case of your sample document there is large tree:
Thus, you have to descend into the structure, not merely iterate over the immediate children of Fields, to find all form fields.

How can I get a list of image URLs from a Markdown file in Python?

I'm looking for something like this:
data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
print get_images_url_from_markdown(data)
that returns a list of image URLs from the text:
['http://somewebsite.com/image1.jpg', 'http://anotherwebsite.com/image2.jpg']
Is there anything available, or do I have to scrape Markdown myself with BeautifulSoup?
Python-Markdown has an extensive Extension API. In fact, the Table of Contents Extension does essentially what you want with headings (instead of images) plus a bunch of other stuff you don't need (like adding unique id attributes and building a nested list for the TOC).
After the document is parsed, it is contained in an ElementTree object and you can use a treeprocessor to extract the data you want before the tree is serialized to text. Just be aware that if you have included any images as raw HTML, this will fail to find those images (you would need to parse the HTML output and extract in that case).
Start off by following this tutorial, except that you will need to create a treeprocessor rather than an inline Pattern. You should end up with something like this:
import markdown
from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension
# First create the treeprocessor
class ImgExtractor(Treeprocessor):
def run(self, doc):
"Find all images and append to markdown.images. "
self.markdown.images = []
for image in doc.findall('.//img'):
self.markdown.images.append(image.get('src'))
# Then tell markdown about it
class ImgExtExtension(Extension):
def extendMarkdown(self, md, md_globals):
img_ext = ImgExtractor(md)
md.treeprocessors.add('imgext', img_ext, '>inline')
# Finally create an instance of the Markdown class with the new extension
md = markdown.Markdown(extensions=[ImgExtExtension()])
# Now let's test it out:
data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
html = md.convert(data)
print md.images
The above outputs:
[u'http://somewebsite.com/image1.jpg', u'http://anotherwebsite.com/image2.jpg']
If you really want a function which returns the list, just wrap that all up in one and you're good to go.

Categories

Resources