how to edit/modify text in PDF - python

I am working on my final year project, so I working on a website where a user can come and read PDF, I am adding some features such as converting currency to their country currency, I am using flask and pymuPDF for my project and I don't know how I can modify the text at a pdf
anyone can help me with this problem
I heard here that using pymuPDF or pypdf2 can work, but I didn't find any solution for replacing text

Using the redaction facility of PyMuPDF is probably the adequate thing to do.
The approach:
Identify the location of the text to replace
Erase the text and replace it using redactions
Care must be taken to get hold of the original font, and whether or not the new text is longer / short than the original.
import fitz # import PyMuPDF
doc = fitz.open("myfile.pdf")
page = doc[number] # page number 0-based
# suppose you want to replace all occurrences of some text
disliked = "delete this"
better = "better text"
hits = page.search_for("delete this") # list of rectangles where to replace
for rect in hit:
page.add_redact_annot(rect, better, fontname="helv", fontsize=11,
align=fitz.TEXT_ALIGN_CENTER, ...) # more parameters
page.apply_annots(images=fitz.PDF_REDACT_IMAGE_NONE) # don't touch images
doc.save("replaced.pdf", garbage=3, deflate=True)
This works well with short text and medium quality expectations.
With some more effort, the original font properties, color, font size, etc. can be identified to produce a close-to-perfect result.

Related

How to use Python Fitz detect Hyphen when using search_for?

I'm new to the Fitz library and am working on a project where I need to find a string in a PDF page. I'm running into a case where the text on the page that I'm searching on is hyphenated. I am aware of the TEXT_DEHYPHENATE flag that I can use in the search for function, but that doesn't work for me (as shown in the image here https://postimg.cc/zHZPdd6v ). I'm getting no cases when I search for the hyphenated string.
Python Script
LOC = "./test.pdf"
doc = fitz.open(LOC)
page = doc[1]
print(page.get_text())
found = page.search_for("lowcost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low-cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
for rect in found:
print(rect)
Output
Abstract
The objective of “XXXXXXXXXXXXXXXXXX” was design and assemble a low-
cost and efficient tool.
DONE
0
DONE
0
DONE
0
Can someone please point me to how I might be able to detect the hyphen in my file? Thank you!
Your first approach should work, look here:
# insert some hyphenated text
page.insert_textbox((100,100,300,300),"The objective of 'xxx' was design and assemble a low-\ncost and efficient tool.")
157.94699853658676
# now search for it again
page.search_for("lowcost") # 2 rectangles!
[Rect(159.3009796142578, 116.24800109863281, 175.8009796142578, 131.36199951171875),
Rect(100.0, 132.49501037597656, 120.17399597167969, 147.6090087890625)]
# each containing a text portion with hyphen removed
for rect in page.search_for("lowcost"):
print(page.get_textbox(rect))
low
cost
Without the original file there is no way to tell the reason for your failure.
Are you sure there really is text - and not e.g. an image or other hickups?
Edited: As per the comment of user #KJ below: PyMuPDF's C base library MuPDF regards all of the unicodes '-', 0xAD, 0x2010, 0x2011 as hyphens in this context. They all should work the same. Just reconfirmed it in an example.

How do you insert an image into a specific location in an existing Word document with Python?

What I want to do is insert an image into a specific location in an existing Word document using Python. I've looked at various libraries to do this; I'm using the docx-mailmerge package to insert text and tables using Word merge fields, but unfortunately image merging is just a TODO/wishlist feature. python-docx meanwhile allows image insertion, but only at the end of a document, not in specific places.
Is there another library that does this, or a good trick to accomplish it?
Fiddling around with the underlying API (and thanks to this SO answer) I hacked my way to success:
add a placeholder in your Word document where the image should go, something like a single line that says [ChartImage1]
find the paragraph object in the document that contains that text
replace the text of that paragraph with an empty string
add a run, and inside that add your image
So something like:
document = Document("template.docx")
image_paras = [i for i, p in enumerate(document.paragraphs) if "[ChartImage1]" in p.text]
p = document.paragraphs[image_paras[0]]
p.text = ""
r = p.add_run()
r.add_picture("path/to/image.png")
document.save("my_doc.docx")

Adding Image at the very beginning of an already existing docx document [duplicate]

I use Python-docx to generate Microsoft Word document.The user want that when he write for eg: "Good Morning every body,This is my %(profile_img)s do you like it?"
in a HTML field, i create a word document and i recuper the picture of the user from the database and i replace the key word %(profile_img)s by the picture of the user NOT at the END OF THE DOCUMENT. With Python-docx we use this instruction to add a picture:
document.add_picture('profile_img.png', width=Inches(1.25))
The picture is added to the document but the problem that it is added at the end of the document.
Is it impossible to add a picture in a specific position in a microsoft word document with python? I've not found any answers to this in the net but have seen people asking the same elsewhere with no solution.
Thanks (note: I'm not a hugely experiance programmer and other than this awkward part the rest of my code will very basic)
Quoting the python-docx documentation:
The Document.add_picture() method adds a specified picture to the end of the document in a paragraph of its own. However, by digging a little deeper into the API you can place text on either side of the picture in its paragraph, or both.
When we "dig a little deeper", we discover the Run.add_picture() API.
Here is an example of its use:
from docx import Document
from docx.shared import Inches
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
r.add_picture('/tmp/foo.jpg')
r.add_text(' do you like it?')
document.save('demo.docx')
well, I don't know if this will apply to you but here is what I've done to set an image in a specific spot to a docx document:
I created a base docx document (template document). In this file, I've inserted some tables without borders, to be used as placeholders for images. When creating the document, first I open the template, and update the file creating the images inside the tables. So the code itself is not much different from your original code, the only difference is that I'm creating the paragraph and image inside a specific table.
from docx import Document
from docx.shared import Inches
doc = Document('addImage.docx')
tables = doc.tables
p = tables[0].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('resized.png',width=Inches(4.0), height=Inches(.7))
p = tables[1].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('teste.png',width=Inches(4.0), height=Inches(.7))
doc.save('addImage.docx')
Here's my solution. It has the advantage on the first proposition that it surrounds the picture with a title (with style Header 1) and a section for additional comments. Note that you have to do the insertions in the reverse order they appear in the Word document.
This snippet is particularly useful if you want to programmatically insert pictures in an existing document.
from docx import Document
from docx.shared import Inches
# ------- initial code -------
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
picPath = 'D:/Development/Python/aa.png'
r.add_picture(picPath)
r.add_text(' do you like it?')
document.save('demo.docx')
# ------- improved code -------
document = Document()
p = document.add_paragraph('Picture bullet section', 'List Bullet')
p = p.insert_paragraph_before('')
r = p.add_run()
r.add_picture(picPath)
p = p.insert_paragraph_before('My picture title', 'Heading 1')
document.save('demo_better.docx')
This is adopting the answer written by Robᵩ while considering more flexible input from user.
My assumption is that the HTML field mentioned by Kais Dkhili (orignal enquirer) is already loaded in docx.Document(). So...
Identify where is the related HTML text in the document.
import re
## regex module
img_tag = re.compile(r'%\(profile_img\)s') # declare pattern
for _p in enumerate(document.paragraphs):
if bool(img_tag.match(_p.text)):
img_paragraph = _p
# if and only if; suggesting img_paragraph a list and
# use append method instead for full document search
break # lose the break if want full document search
Replace desired image into placeholder identified as img_tag = '%(profile_img)s'
The following code is after considering the text contains only a single run
May be changed accordingly if condition otherwise
temp_text = img_tag.split(img_paragraph.text)
img_paragraph.runs[0].text = temp_text[0]
_r = img_paragraph.add_run()
_r.add_picture('profile_img.png', width = Inches(1.25))
img_paragraph.add_run(temp_text[1])
and done. document.save() it if finalised.
In case you are wondering what to expect from the temp_text...
[In]
img_tag.split(img_paragraph.text)
[Out]
['This is my ', ' do you like it?']
I spend few hours in it. If you need to add images to a template doc file using python, the best solution is to use python-docx-template library.
Documentation is available here
Examples available in here
This is variation on a theme. Letting I be the paragraph number in the specific document then:
p = doc.paragraphs[I].insert_paragraph_before('\n')
p.add_run().add_picture('Fig01.png', width=Cm(15))

How to robustly extract author names from pdf papers?

I'd like to extract author names from pdf papers. Does anybody know a robust way to do so?
For example, I'd like to extract the name Archana Shukla from this pdf https://arxiv.org/pdf/1111.1648
PDF documents contain Metadata. It includes information about the document and its contents such as the author’s name, keywords, copyright information. See Adobe doc.
You can use PyPDF2 to extract PDF Metadata. See the documentation about the DocumentInformation class.
This information may not be filled and can appear blank. So, one possibility is to parse the beginning or the end of the text and extract what you think is the author name. Of course, it is not reliable. But, if you have a bibliographic database, to can try a match.
Nowadays, editors like Microsoft Word or Libre Office Writer always fill the author name in the Metadata. And it is copied in the PDF when you export your documents. So, this should work for you. Give it a try and tell us!
I am going to pre-suppose that you have a way to extract text from a PDF document, so the question is really "how can I figure out the author from this text". I think one straightforward solution is to use the correspondence email. Here is an example implementation:
import difflib
# Some sample text
pdf_text="""SENTIMENT ANALYSIS OF DOCUMENT BASED ON ANNOTATION\n
Archana Shukla\nDepartment of Computer Science and Engineering,
Motilal Nehru National Institute of Technology,
Allahabad\narchana#mnnit.ac.in\nABSTRACT\nI present a tool which
tells the quality of document or its usefulness based on annotations."""
def find_author(some_text):
words = some_text.split(" ")
emails = []
for word in words:
if "#" in word:
emails.append(word)
emails_clean = emails[0].split("\n")
actual_email = [a for a in emails_clean if "#" in a]
actual_email = actual_email[0]
maybe_name = actual_email.split("#")[0]
all_words_lists = [a.split("\n") for a in words]
words = [a for sublist in all_words_lists for a in sublist]
words.remove(actual_email)
return difflib.get_close_matches(maybe_name, words)
In this case, find_author(pdf_text) returns ['Archana']. It's not perfect, but it's not incorrect. I think you could likely extend this in some clever ways, perhaps by getting the next word after the result or by combining this guess with metadata, or even by finding the DOI in the document if/when it exists and looking it up through some API, but nonetheless I think this should be a good starting point.
First thing first, there are some pdfs out there which pages are image. I don't know if you can extract the text from image easily. But from the pdf link you mentioned, I think it can be done. There is exist a package called PyPDF2 which as I know, can extract the text from pdf. All that left is to scan the last few pages and parse the Author names.
An example on how to use the package described here. Some of the code listed there is as follows:
import PyPDF2
pdfFileObj = open('meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
disp(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
pageObj.extractText()

I can't change the style of text in Word documents with Python-docx

I created a word document which contains the text
Hello. You owe me ${debt}. Please pay me back soon.
in Times New Roman size 12. The file name is debtTemplate.docx. I would like to replace {debt} by an actual number (1.20) using python-docx. I tried that following code:
from docx import Document
document = Document("debtTemplate.docx")
paragraphs = document.paragraphs
debt = "1.20"
paragraph = paragraphs[0]
text = paragraph.text
newText = text.format(debt=debt)
paragraph.clear()
paragraph.add_run(newText)
document.save("debt.docx")
This results in a new document with the desired text, but in Calabri font size 11. I would like the font to be like the original: Times New Roman size 12.
I know that you can add a style variable to paragraph.add_run(), so I tried that but nothing work. Eg paragraph.add_run(newText,style="Strong") didn't even change anything.
Does anyone know what I can do?
EDIT: here's a modified version of my code that I had hoped would work but didn't.
from docx import Document
document = Document("debtTemplate.docx")
document.save("debt.docx")
paragraphs = document.paragraphs
debt = "1.20"
paragraph = paragraphs[0]
style = paragraph.style
text = paragraph.text
newText = text.format(debt=debt)
paragraph.clear()
paragraph.add_run(newText,style)
document.save("debt.docx")
This page in the docs should help you understand why the style is not having an effect. It's a pretty easy fix: http://python-docx.readthedocs.org/en/latest/user/styles.html
I like a couple other things about what you've found though:
Using the str.format() method to do placeholder replacement is a nice, easy way to do lightweight text replacement. I'll have to add that to the documentation as an approach to simple custom document generation.
In the XML for a paragraph, there is an optional element called <w:defRPr> which Word uses to indicates the default formatting for any new text added to the paragraph, like if you started typing after placing your insertion point at the end of the paragraph. Right now, python-docx ignores that element. That's why you're getting the default Calibri 11 instead of the Times New Roman 12 you started with. But a useful feature might be to use that element, if present, to assign run properties to any new runs added at the end of the paragraph. If you want to add that as a feature request to the GitHub tracker we'll take a look at getting it implemented.

Categories

Resources