How to remove annotations in pdf in Python 3 - python

My original goal was to remove the extensive white margins on my PDF pages.
Then I found this purpose can be achieved by scaling the page using the code below, but annotations are not scaled.
import PyPDF2
# This works fine
with open('old.pdf', 'rb') as pdf_obj:
pdf = PyPDF2.PdfFileReader(pdf_obj)
out = PyPDF2.PdfFileWriter()
for page in pdf.pages:
page.scale(2, 2)
out.addPage(page)
with open('new.pdf', 'wb') as f:
out.write(f)
# This attempts to remove annotations
with open('old.pdf', 'rb') as pdf_obj:
pdf = PyPDF2.PdfFileReader(pdf_obj)
page = pdf.pages[2]
print(page['/Annots'], '\n\n\n\n')
page.Annots = []
print(page['/Annots'])
Is there a way to remove annotations? Or any suggestion that can help me to get rid of the white margin.

The method PdfFileWriter.removeLinks() removes links and annotations. So, if you are okay with losing both you can add out.removeLinks() in your first block of code, the one that's working fine.

Related

pypdf gives output with incorrect PDF format

I am using the following code to resize pages in a PDF:
from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize
from pypdf.generic import RectangleObject
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
A4_w = PaperSize.A4.width
A4_h = PaperSize.A4.height
# resize page to fit *inside* A4
h = float(page.mediabox.height)
w = float(page.mediabox.width)
scale_factor = min(A4_h/h, A4_w/w)
transform = Transformation().scale(scale_factor,scale_factor).translate(0, A4_h/2 - h*scale_factor/2)
page.add_transformation(transform)
page.cropbox = RectangleObject((0, 0, A4_w, A4_h))
# merge the pages to fit inside A4
# prepare A4 blank page
page_A4 = PageObject.create_blank_page(width = A4_w, height = A4_h)
page.mediabox = page_A4.mediabox
page_A4.merge_page(page)
writer.add_page(page_A4)
writer.write('output.pdf')
Source: https://stackoverflow.com/a/75274841/11501160
While this code works fine for the resizing part, I have found that most input files work fine but some input files do not work fine.
I am providing download links to input.pdf and output.pdf files for testing and review. The output file is completely different from the input file. The images are missing, the background colour is different, even the pure text on first page has only the first line visible.
What is interesting is that these difference are only seen when I open the output pdf in Adobe Acrobat, or look at the physically printed pages.
The PDF looks perfect when i open in Preview (on MacOS) or open the PDF in my Chrome Browser.
and
The origin of the input pdf is that I created it in Preview (on MacOS) by mixing pages from different PDFs and dragging image files into the thumbnails as per these instructions:
https://support.apple.com/en-ca/HT202945
I've never had a problem before while making PDFs like this and even Adobe Acrobat reads the input pdf properly. Only the output pdf is problematic in Acrobat and in printers.
Is this a bug with pypdf or am I doing something wrong ?
How can i get the output PDF to be proper in Adobe Acrobat and printers etc ?
This is a valid bug with pypdf and the fix is due to be released in the next version.
Refer:
https://github.com/py-pdf/pypdf/issues/1607
The following is what PyMuPDF has to offer here. The output displays correctly in all PDF readers:
import fitz # import PyMuPDF
src = fitz.open("input.pdf")
doc = fitz.open()
for i in range(len(src)):
page = doc.new_page() # this is A4 portrait by default
page.show_pdf_page(page.rect, src, i) # scaling will happen automatically
doc.save("fitz-output.pdf",garbage=3,deflate=True)
The above method show_pdf_page() supports many more options, like selecting sub-rectangles form the source page, rotating it by arbitrary angles, and of course freely select the target page's sub-rectangle to receive the content.

Cropping the Mediabox does not work for some pdfs

I wrote a little script which shall blank out the lower half of a PDF document. The document itself shall remain the same size, but the lower half shall be just white.
(This is to remove the "instructions" part from parcel labels of German parcel comanies like DHL and Hermes.)
To do this, I take the PDF page, adjust the Mediabox, and then merge this page onto a new, blank page.
Fortunately, this works as intended with the PDFs I need it for. However, I also tried a few other PDFs and for some, it just does not work. It copies over the complete PDF. This happens for example, when my code is given this file: https://www.veeam.com/veeam_backup_product_overview_ds.pdf
Here is the code:
import pypdf # PyPDF2, 3 and 4 are deprecated. PyPDF is currently in active development
reader = pypdf.PdfReader(source_filename)
writer = pypdf.PdfWriter()
# get first page
page = reader.pages[0]
# create new page
new_page = pypdf.PageObject.create_blank_page( None, width = page.mediabox.width, height = page.mediabox.height )
# crop original
page.mediabox.bottom = ( page.mediabox.top - page.mediabox.bottom ) / 2 + page.mediabox.bottom
# merge original into empty new page
new_page.merge_page( page )
writer.add_page(new_page)
with open(output_file, "wb") as fp:
writer.write(fp)
Can anyone explain why it does not work sometimes?

Extracting Powerpoint background images using python-pptx

I have several powerpoints that I need to shuffle through programmatically and extract images from. The images then need to be converted into OpenCV format for later processing/analysis. I have done this successfully for images in the pptx, using:
for slide in presentation:
for shape in slide.shapes
if 'Picture' in shape.name:
pic_list.append(shape)
for extraction, and:
img = cv2.imdecode(np.frombuffer(page[i].image.blob, np.uint8), cv2.IMREAD_COLOR)
for python-pptx Picture to OpenCV conversion. However, I am having a lot of trouble extracting and manipulating the backgrounds in a similar fashion.
slide.background
is sufficient to extract a "_Background" object, but I have not found a good way to convert it into a OpenCV object similar to Pictures. Does anyone know how to do this? I am using python-pptx for extraction, but am not adverse to other packages if it's not possible with that package.
After a fair bit of work I discovered how to do this -- i.e., you don't. As far as I can tell, there is no way to directly extract the backgrounds with either python-pptx or Aspose. Powerpoint -- which, as it turns out, is an archive that can be unzipped with 7zip -- keeps its backgrounds disassembled in the ppt/media (pics), ppt/slideLayouts and ppt/slideMasters (text, formatting), and they are only put together by the Powerpoint renderer. This means that to extract the backgrounds as displayed, you basically need to run Powerpoint and take pics of the slides after removing text/pictures/etc. from the foreground.
I did not need to do this, as I just needed to extract text from the backgrounds. This can be done by checking slideLayouts and slideMasters XMLs using BeautifulSoup, at the <a:t> tag. The code to do this is pretty simple:
import zipfile
with zipfile.ZipFile(pptx_path, 'r') as zip_ref:
zip_ref.extractall(extraction_directory)
This will extract the .pptx into its component files.
from glob import glob
layouts = glob(os.path.join(extr_dir, 'ppt\slideLayouts\*.xml'))
masters = glob(os.path.join(extr_dir, 'ppt\slideMasters\*.xml'))
files = layouts + masters
This gets you the paths for slide layouts/masters.
from bs4 import BeautifulSoup
text_list = []
for file in files:
with open(file) as f:
data = f.read()
bs_data = BeautifulSoup(data, "xml")
bs_a_t = bs_data.find_all('a:t')
for a_t in bs_a_t:
text_list.append(str(a_t.contents[0]))
This will get you the actual text from the XMLs.
Hopefully this will be useful to someone else in the future.

Adding extra space to pages in PDF

I have a couple of PDFs I want to add a few inches on one side to give myself more room for handwritten comments in a notes app. Basically, I want to give myself more room to scribble on the sides of the pages (lecture scripts).
The pages should not be scaled, I simply want the contents to stay at the same spot from the upper left corner, but add more space at the right and maybe at the bottom.
Is there a good way to to this either using one of the Python PDF libs or using a command line tool?
Can I simply add extra space to the Media box, or do I need to do something else?
OK, the following code seems to work.
Had to set mediaBox and cropBox to get the desired result.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf = PdfFileReader("org.pdf")
writer = PdfFileWriter()
factor = 1.3
for page in pdf.pages:
x,y = page.mediaBox.lowerRight
page.mediaBox.lowerRight = ( (factor * float(x)), float(y))
x,y = page.cropBox.lowerRight
page.cropBox.lowerRight = ( (factor * float(x)), float(y))
writer.addPage( page )
with open("out.pdf", "wb") as out_f:
writer.write(out_f)

Using Python to scrape text from PDFs that encode their text to images

My code is below. I've tried it on other PDFs and it was able to extract the text accurately.
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
Specifically when I run the above code there is no output. The provider of the PDF tries to sell the data in the PDF, so it makes sense why they don't want it to be easily scraped. Just wondering what the best workaround is because I don't have 100k lying around.
If it helps it looks like the PDF was produced with pdfsharp.net. When I upload my PDF in Google Colab and assign it to a variable, a portion of the result of printing that variable is below.
{'test.pdf': b'%PDF-1.4\n%\xd3\xf4\xcc\xe1\n1 0
obj\n<<\n/CreationDate(D:20190310110705-04\'00\')\n/Title(Efficiency Summary
Player Name)\n/Creator(PDFsharp 1.32.2608-w \\(www.pdfsharp.net\\))\n/Producer(PDFsharp 1.32.2608-w \\(www.pdfsharp.net\\))\n>>\nendobj\n2 0 obj\n<<\n/Type/Catalog\n/Pages 3 0 R\n>>\nendobj\n3 0 obj\n<<\n/Type/Pages\n/Count 1\n/Kids[4 0 R]\n>>\nendobj\n4 0 obj\n<<\n/Type/Page\n/MediaBox[0 0 612 792]\n/Parent 3 0 R\n/Contents 5 0 R\n/Resources\n<<\n/ProcSet [/PDF/Text/ImageB/ImageC/ImageI]\n/XObject\n<<\n/I0 8 0 R\n>>\n>>\n/Group\n<<\n/CS/DeviceRGB\n/S/Transparency\n/I false\n/K false\n>>\n>>\nendobj\n5 0 obj\n<<\n/Length 62\n/Filter/FlateDecode\n>>\nstream\nx\x9c+\xe42T0\x00B]\x10eni\xa4\x90\x9c\x0bd\x1b\x18(\x84Tq\x15r\x15*\x98\x9a\x1aA\xe4\xcd\xcd\xcc\x14\x8c\x8d\x14\xcc\xcd\xcd#J\xf4=\r\x14\\\xf2\x15\x02\xb9#\x10\x00\xd8\xf3\r\xe0\nendstream\nendobj\n6 0 obj\n<<\n/Type/XObject\n/Subtype/Image\n/Length 159\n/Filter/FlateDecode\n/Width 900\n/Height 1250\n/BitsPerComponent 1\n/ImageMask true\n>>\nstream\nx\x9c\xed\xc11\x01\x00\x00\x00\xc2 \xfb\xa76\xc6\x1e`\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\xe8\'\xe0\x00\x01\nendstream\nendobj\n7 0 obj\n<<\n/Type/XObject\n/Subtype/Image\n/Length 6413\n/Filter/FlateDecode\n/Width 900\n/Height 1250\n/BitsPerComponent 8\n/ColorSpace/DeviceGray\n>>\nstream\nx\x9c\xed\xdd\x81z\xa2\xbc\x16\x05\xd0\xf7\x7f\xe9\xe4\xde\xbf\x85\xe4\x9c$X\xdb\xb1\x15t\xado\xa6U\x0c!\x02\xdb#\xb4R\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
This code might be useful to you, I used it for a previous project where I scraped data from a pdf. I'm not sure if you've tried using pytesseract. You can modify the for page in pages loop to extract specific pages. This code will turn the PDF into images, then use OCR processing and return a text file with the text found.
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import os
def OCR(pdf):
pdfName = pdf.split('.pdf')[0]
pages = convert_from_path(pdf, 500)
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(pdfName+filename, 'JPEG')
image_counter = image_counter + 1
filelimit = image_counter-1
f= open(pdfName+".txt","wb")
text = ''
for i in range(1, filelimit + 1):
filename = pdfName+"page_"+str(i)+".jpg"
text += str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
text = text.replace('\n',' \n')
os.remove(pdfName+"page_"+str(i)+".jpg")
f.write(text.encode('utf-8','replace'))
f.close()
return text
you're just seeing the raw bytes of a PDF file there, the fact they've put the "Info dict" at the top of the file, and hence seeing strings like \Creator, isn't guaranteed and just because it's a "linearised" file
doing something like Daniel suggested is the way to go, but his implementation might introduce additional artifacts. tesseract is OCR software and attempts to turn rasterized text back into characters. it might be better working directly with images in the PDF file, rather than rasterizing the whole page to an image. also encoding to a JPEG seems awkward, using a lossless format like PNG is probably going to do slightly better
generally I'd recommend using something like pytesseract, but something else, e.g. see here for getting at the images directly

Categories

Resources