I am currently using python to remove watermarks in PDF files. For example, I have a file like this:
The green shape on the center of the page is the watermark. I think it's not stored in the PDF in text form, because I can't find that text by simply searching using Edge browser (which can read PDF files).
Also, I cannot find the watermark by image. I extracted all images from the PDF using PyMuPDF, and the watermark (which was supposed to appear on each page) is not to be found.
The code I used for extracting is like this:
document = fitz.open(self.input)
for each_page in document:
image_list = each_page.getImageList()
for image_info in image_list:
pix = fitz.Pixmap(document, image_info[0])
png = pix.tobytes() # return picture in png format
if png == watermark_image:
document._deleteObject(image_info[0])
document.save(out_filename)
So how do I find and remove the watermark using python's libraries? How is the watermark stored inside a PDF?
Are there any other "magic" libraries that can do this task, other than PyMuPDF?
For anyone interested in details see the solution provided here.
Removal of the type of watermark used in this file works with PyMuPDF's low-level code interface. There is no direct, specialized high-level API for doing this.
Related
I tried adding an image into a PDF in a specific field in a particular position,
I also referred to another post in the community (How to add image to PDF file in Python?),
This code allows us to overlay the entire Image over PDF, but we want to insert the image in a particular position of a PDF.
Thanks in advance for the Leads.
Is there a way to add a watermark containing some images (icons, data matrix codes, preferred in a vector format) and text to a PDF in a way that the original appearance of the PDF can be restored, whenever needed?
In other words, I want to implement the following:
- add a watermark to existing PDF
- remove this watermark whenever desired
The watermark is provided by me in whichever format it might be needed to achieve my goal.
I have found implementations written in Python like this solution using PyPDF2. But I have not found a way to remove the added watermark afterwards. Also I have found a solution to add and remove watermarks using iText, which unfortunatelly is not a Python library.
Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?
Using pdfimages from poppler-utils
You might want to take a look at pdfimages from the poppler-utils package.
I have taken the sample pdf from - Sample PDF
On running the following command, images present in the pdf are extracted -
pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image
Some of the images extracted from this brochure are -
Extracted Image1
Extracted Image 2
So, you can use python's subprocess module to execute this command, and then extract all the images.
Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.
Using pdfminer
If you want to do this using pdfminer, take a look at this blog post -
Extracting Text & Images from PDF Files
Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -
Layout Objects and Tree Structure
Image Source - Pdfminer Docs
Thus, extracting LTFigure objects can help you extract / count images in the pdf document.
Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.
I don't think this can be directly done. Although I have done something similar using the following approach
Using ghostscript to convert pdf to page images.
On each page use computer vision (OpenCV) to extract the area of interest(in your case images).
I'm trying to put an JPG image into a PDF using reportlab like follow, in python language.
p = canvas.Canvas(buffer)
p.drawImage(filename_jpg_image,x,y)
The problem here is that the image displayed in the pdf does not have the same quality as the original one. I want to know if there are a way to specify the quality in this context, or improve it anyway. Anybody can help me?
Unfortunately, most tools that put JPEGs into PDFs will uncompress and then (badly) recompress the JPEG.
img2pdf can wrap many (most?) JPEG images into PDFs without changing the compression (without decompressing at all, in fact).
Then you can use pdfrw to pull that PDF onto the reportlab canvas as a form xObject (similar to an image). There are a few examples in the pdfrw/examples/rl1 directory that show how to do this.
Disclaimer: I am the pdfrw author.
I'm trying to remove a large number of very small images from a series of PDF documents using the awesome looking PDFTron library for Python. Basically I want to create a new PDF by going over each element in an existing PDF file and copying the ones that meet a certain size criteria to the new PDF in the same position.
Can someone guide me to PDFTron documentation specifically for Python to help me accomplish this? Or provide a sample script that checks for image size? I think I can do the rest (emphasis on think). The documentation available on the PDFTron website is not specifically for Python, hard to look up what I need...
You can see from the ElementEdit sample how to remove all images from a document:
http://www.pdftron.com/pdfnet/samplecode.html#ElementEdit
Or provide a sample script that checks for image size?
Could you clarify what you mean by "image size"? If you mean the image's dimensions as displayed in the PDF page, you can check that using Element.GetBBox. If you mean the dimensions of the original image, you could check that using Element.GetImageWidth and Element.GetImageHeight (see http://www.pdftron.com/pdfnet/samplecode.html#ImageExtract). Also, Image.GetImageDataSize gives you the size of the image data in bytes.