Is there a way to add a watermark containing some images (icons, data matrix codes, preferred in a vector format) and text to a PDF in a way that the original appearance of the PDF can be restored, whenever needed?
In other words, I want to implement the following:
- add a watermark to existing PDF
- remove this watermark whenever desired
The watermark is provided by me in whichever format it might be needed to achieve my goal.
I have found implementations written in Python like this solution using PyPDF2. But I have not found a way to remove the added watermark afterwards. Also I have found a solution to add and remove watermarks using iText, which unfortunatelly is not a Python library.
Related
https://babel.hathitrust.org/cgi/pt?id=mdp.35112102781921&view=1up&seq=11
I want to use a program or a command (latter preferred) to remove images of the same position (e.g., Digitized by Google) in pdf files like the above.
Could you show me a convenient way to do so?
I am currently using python to remove watermarks in PDF files. For example, I have a file like this:
The green shape on the center of the page is the watermark. I think it's not stored in the PDF in text form, because I can't find that text by simply searching using Edge browser (which can read PDF files).
Also, I cannot find the watermark by image. I extracted all images from the PDF using PyMuPDF, and the watermark (which was supposed to appear on each page) is not to be found.
The code I used for extracting is like this:
document = fitz.open(self.input)
for each_page in document:
image_list = each_page.getImageList()
for image_info in image_list:
pix = fitz.Pixmap(document, image_info[0])
png = pix.tobytes() # return picture in png format
if png == watermark_image:
document._deleteObject(image_info[0])
document.save(out_filename)
So how do I find and remove the watermark using python's libraries? How is the watermark stored inside a PDF?
Are there any other "magic" libraries that can do this task, other than PyMuPDF?
For anyone interested in details see the solution provided here.
Removal of the type of watermark used in this file works with PyMuPDF's low-level code interface. There is no direct, specialized high-level API for doing this.
This is a very straightforward issue. I added an invisible text layer using page.insert_text().
After saving the modified pdf, I can use page.get_text() to retrieve the created text layer.
I would like to be able to eliminate that layer, buy couldn't find a function to do it.
The solution I've came up with is taking the pages as images and create a new pdf. But it seems like a very inefficient solution.
I would like to be able to solve this issue without using a different library other than fitz and it feels like it should be a solution within fitz, considering that page.get_text() can access the exact information I'm trying to eliminate
If you are certain of the whereabouts of your text on the page (and I understood that you are), simply use PDF redactions:
page.add_redact_annot(rect1) # remove text inside this rectangle
page.add_redact_annot(rect2)
...
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
# the above removes everything intersecting any of the rects,
# but leaves images untouched
Obviously you can remove all text on the page by taking page.rect as the redaction rectangle.
Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?
Using pdfimages from poppler-utils
You might want to take a look at pdfimages from the poppler-utils package.
I have taken the sample pdf from - Sample PDF
On running the following command, images present in the pdf are extracted -
pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image
Some of the images extracted from this brochure are -
Extracted Image1
Extracted Image 2
So, you can use python's subprocess module to execute this command, and then extract all the images.
Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.
Using pdfminer
If you want to do this using pdfminer, take a look at this blog post -
Extracting Text & Images from PDF Files
Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -
Layout Objects and Tree Structure
Image Source - Pdfminer Docs
Thus, extracting LTFigure objects can help you extract / count images in the pdf document.
Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.
I don't think this can be directly done. Although I have done something similar using the following approach
Using ghostscript to convert pdf to page images.
On each page use computer vision (OpenCV) to extract the area of interest(in your case images).
I'm trying to remove a large number of very small images from a series of PDF documents using the awesome looking PDFTron library for Python. Basically I want to create a new PDF by going over each element in an existing PDF file and copying the ones that meet a certain size criteria to the new PDF in the same position.
Can someone guide me to PDFTron documentation specifically for Python to help me accomplish this? Or provide a sample script that checks for image size? I think I can do the rest (emphasis on think). The documentation available on the PDFTron website is not specifically for Python, hard to look up what I need...
You can see from the ElementEdit sample how to remove all images from a document:
http://www.pdftron.com/pdfnet/samplecode.html#ElementEdit
Or provide a sample script that checks for image size?
Could you clarify what you mean by "image size"? If you mean the image's dimensions as displayed in the PDF page, you can check that using Element.GetBBox. If you mean the dimensions of the original image, you could check that using Element.GetImageWidth and Element.GetImageHeight (see http://www.pdftron.com/pdfnet/samplecode.html#ImageExtract). Also, Image.GetImageDataSize gives you the size of the image data in bytes.