How to remove images in a pdf file? - python

https://babel.hathitrust.org/cgi/pt?id=mdp.35112102781921&view=1up&seq=11
I want to use a program or a command (latter preferred) to remove images of the same position (e.g., Digitized by Google) in pdf files like the above.
Could you show me a convenient way to do so?

Related

How to crop empty space from SVG?

How do you crop all empty space from an SVG file, either from the command line or Python?
I have several SVG files formatted to the standard A2 letter document size, yet are mostly empty, and I need to bulk crop them down so their view box is the same as the minimum bounding box for their contents.
I can do this in Inkscape using the "Resize Page to Selection" option, but I don't see any way to access this function from the command-line. I thought it might be a call like:
inkscape -z --verb=FitCanvasToDrawing --verb=FileSave --verb=FileClose file.svg
as suggested here but that has no effect.

Add and Remove Watermark to PDF using Python

Is there a way to add a watermark containing some images (icons, data matrix codes, preferred in a vector format) and text to a PDF in a way that the original appearance of the PDF can be restored, whenever needed?
In other words, I want to implement the following:
- add a watermark to existing PDF
- remove this watermark whenever desired
The watermark is provided by me in whichever format it might be needed to achieve my goal.
I have found implementations written in Python like this solution using PyPDF2. But I have not found a way to remove the added watermark afterwards. Also I have found a solution to add and remove watermarks using iText, which unfortunatelly is not a Python library.

Count Images in a pdf document through python

Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?
Using pdfimages from poppler-utils
You might want to take a look at pdfimages from the poppler-utils package.
I have taken the sample pdf from - Sample PDF
On running the following command, images present in the pdf are extracted -
pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image
Some of the images extracted from this brochure are -
Extracted Image1
Extracted Image 2
So, you can use python's subprocess module to execute this command, and then extract all the images.
Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.
Using pdfminer
If you want to do this using pdfminer, take a look at this blog post -
Extracting Text & Images from PDF Files
Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -
Layout Objects and Tree Structure
Image Source - Pdfminer Docs
Thus, extracting LTFigure objects can help you extract / count images in the pdf document.
Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.
I don't think this can be directly done. Although I have done something similar using the following approach
Using ghostscript to convert pdf to page images.
On each page use computer vision (OpenCV) to extract the area of interest(in your case images).

mapnik and local tiff tiles

I have a local directory full of geotiff files which make up a map of the UK.
I'm using mapnik to render different images at various locations in the UK.
I'm wondering what is the best way to approach this?
I can create a single RasterSymbolizer then loop through the tiff directory and add each tiff as a seperate layer, then use mapniks zoom_to_box to render at the correct location.
But would this cause the rendering time to be unnecessarily slow? I have no information on how the tiles fit together (other than the data in each individual tiff of course).
I imagine there may be a way to setup some kind of vector file defining the tiff layout so I can quickly query that to find out which tile I need to render for a given bounding box?
You can either generate a big tiff file from the original tiffs with gdal_merge.py (you can find it in the python-gdal package on Debian or Ubuntu) or create a virtual file that mixes them all with gdal_merge-vrt. This second option saves space but probably is slower.

Python: Import multiple images from a folder and scale/combine them into one image?

I have a script to save between 8 and 12 images to a local folder. These images are always GIFs. I am looking for a python script to combine all the images in that one specific folder into one image. The combined 8-12 images would have to be scaled down, but I do not want to compromise the original quality(resolution) of the images either (ie. when zoomed in on the combined images, they would look as they did initially)
The only way I am able to do this currently is by copying each image to power point.
Is this possible with python (or any other language, but preferably python)?
As an input to the script, I would type in the path where only the images are stores (ie. C:\Documents and Settings\user\My Documents\My Pictures\BearImages)
EDIT: I downloaded ImageMagick and have been using it with the python api and from the command line. This simple command worked great for what I wanted: montage "*.gif" -tile x4 -geometry +1+1 -background none combine.gif
If you want to be able to zoom into the images, you do not want to scale them. You'll have to rely on the image viewer to do the scaling as they're being displayed - that's what PowerPoint is doing for you now.
The input images are GIF so they all contain a palette to describe which colors are in the image. If your images don't all have identical palettes, you'll need to convert them to 24-bit color before you combine them. This means that the output can't be another GIF; good options would be PNG or JPG depending on whether you can tolerate a bit of loss in the image quality.
You can use PIL to read the images, combine them, and write the result. You'll need to create a new image that is the size of the final result, and copy each of the smaller images into different parts of it.
You may want to outsource the image manipulation part to ImageMagick. It has a montage command that gets you 90% of the way there; just pass it some options and the names of the files in the directory.
Have a look at Python Imaging Library.
The handbook contains several examples on both opening files, combining them and saving the result.
The easiest thing to do is turn the images into numpy matrices, and then construct a new, much bigger numpy matrix to house all of them. Then convert the np matrix back into an image. Of course it'll be enormous, so you may want to downsample.

Categories

Resources