Replacing Images with Image Names instead in Pdf using pymupdf - python

Using PyMuPDF, I want to extract all images from pdf and save them separately and replace all images in pdf with just their image names at the same image place and save as another document. I can save all images with following code.
import fitz
#This creates the Document object doc
doc = fitz.open("Article_Example_1_2.pdf")
html_text=""
for i in range(len(doc)):
print(doc[i]._getContents())
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha < 4: # this is GRAY or RGB or pix.n < 5
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
doc.save(filename=r"new.pdf")
doc.close()
but not sure how to replace them all in pdf with their stored images names. Would greatly appreciate if anyone can help me out here.

Message from the repo maintainer:
I am not sure whether we have discussed this in the issue blog of the repo. What you can do is using the new feature "redaction annotation". Basic approach:
Calculate the bbox of each image via Page.getImageBbox().
Add a redaction annotation via Page.addRedactAnnot(bbox, text=filename, ...).
When finished with the page, execute Page.apply_redactions(). This will remove all images and all redactions. The chosen filename will appear in the former image bbox.
Save as a new document.
Make sure to use PyMuPDF v1.17.0 or later.

Related

How to remove boxes around shx text without AutoCAD?

I try to use OCR (Optical Character Reader) for a lot of documents of the same type. I use pdf2image library for Python. But when it sees pdfs with AutoCAD shx text it captures the bounding boxes around text as well. At first they are not visible on pdf. You need to click on text to see the boxes. But they appear in jpg result after conversion.
Here's an image of a part of pdf document:
crop from actual pdf
And here's the output of conversion: pdf after conversion
I expect somethink like that:
pdf after conversion how it should be
Here's my function for pdf2image conversion:
def get_image_for_blueprint(path, dpi):
"""Create image for full blueprint from path
path: path to pdf file
"""
from pdf2image import convert_from_bytes
images = convert_from_bytes(open(path, 'rb').read(), dpi=dpi) # Actual conversion function
for i in images:
width, height = i.size
if width < height:
continue
else:
print(i.size) # tuple : (width, height)
image = i.resize((4096, int(height / width * 4096)))
enhancer = ImageEnhance.Sharpness(image)
image = enhancer.enhance(2) # Sharpness
enhancer = ImageEnhance.Color(image)
image = enhancer.enhance(0) # black and white
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2) # Contrast
image = np.asarray(image) # array
image = image.astype(np.uint8)
return image
I found solutions to be made in AutoCAD before saving the document but I can not really find a way to get rid of these boxes having pdf only.(in python or c++)
Maybe it's possible to resolve using any other programming language library or additional software.

How can I avoid extracting small image elements from PDF file in python?

I am trying to extract all the images from this PDF file: https://s3.us-west-2.amazonaws.com/secure.notion-static.com/566ca0ca-393d-47d4-b3fc-eb3632777bf8/example.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAT73L2G45O3KS52Y5%2F20210610%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20210610T041944Z&X-Amz-Expires=86400&X-Amz-Signature=2f8a2d08647e4953448f890adb56d11b1d01e21b941ca3dc9f9b5ab3caa7f018&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22example.pdf%22
using the fitz (PyMuPDF module)
Using the following code is extracting all the images, small icons as well. I have to avoid extracting those icons and get images only.
import fitz
file = fitz.open("example.pdf")
pdf = fitz.open(file)
page = len(file)
for pic in range(page):
image_list = pdf.getPageImageList(pic)
j = 1
for image in image_list:
xref = image[0]
pix = fitz.Pixmap(pdf, xref)
#print(len(pix)+ 88)
if pix.n < 5:
pix.writePNG(f'{pic}_{j}.png')
else:
pix1 = fitz.open(fitz.csRGB, pix)
pix1.writePNG(f'{xref}_{pic}.png')
pix1 = None
pix = None
j = j + 1
print(f'Total images on page {pic} are {len(image_list)}')
get_page_images() returns a list of all images (directly or indirectly) referenced by the page.
>>> doc = fitz.open("pymupdf.pdf")
>>> imglist = doc.getPageImageList(0)
>>> for img in imglist: print img
((241, 0, 1043, 457, 8, 'DeviceRGB', '', 'Im1'))
In the above example doc.getPageImageList(0) returns a list of images shown on the page. Each entry looks like [xref, smask, width, height, bpc, colorspace, alt. colorspace, name]
So, in the above example, values 1043 and 457 correspond to width and height of the image. You can provide an if condition to eliminate small sized image/icons.
More information at this doc link

How do I get the largest text in an image using tesseract in Python?

I am trying to extract the title of a PDF file. The metadata of the file doesn't really help. So I am thinking of converting the first page of each PDF file to images and read this image using Tesseract. I can assume that the largest text found on the image is the title.
I read the PDF using fitz and load the first page to be stored into an image format.
import fitz
doc = fitz.open(filename)
page = doc.loadPage(0)
pix = page.getPixmap()
pix.writePNG("output.png")
Then I read the image file using OpenCV, put it into tesseract, and put bounding boxes on the words detected.
filename = 'output.png'
img = cv2.imread(filename)
h, w, _ = img.shape
boxes = pytesseract.image_to_boxes(img) # also include any config options you use
for b in boxes.splitlines():
b = b.split(' ')
img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
cv2.imshow(filename, img)
cv2.waitKey(0)
I am not really familiar with OCR tesseract so here's where I am stuck. How do I get the text with the largest bounding boxes?
My PDF files are mostly scientific papers/journals. So you get the idea of how my files look like.
Thank you.
Normally Tesseract returns the OCR operation result as a nested structure as follows:
Block
Lines
Words
Chars (only in Tesseract 3, for Tesseract 4 you only have words boxes)
Using pytesseract.image_to_data you should get data about line/word index.
My suggestion is to go through the words of each line and find the line with the largest average word height, which most probably is the title of the paper.
Please refer to this answer to see how to get words boxes

Why imwirte function writes my images in BGR order?

I am trying to read images in a folder and save them after doing some processes.
I using following code to save my images:
import cv2
i = 0
while i < len(bright_images):
cv2.imwrite(f'/path/image_{i}.png', cv2.cvtColor(bright_images[i],cv2.COLOR_BGR2RGB)
i += 1
But the problem is that when writing images all red colors in my images turn to be blue, colors change completely, it seems as if it saves based on BGR color instead of RGB.
How do I fix this issue?
FYI, I am reading images using this code:
def load_images(path):
image_list=[]
images= glob.glob(path)
images = natsorted(images)
for index in range(len(images)):
image= cv2.cvtColor(cv2.imread(images[index]),cv2.COLOR_BGR2RGB)
image_list.append(cv2.resize(image,(1920,1080)))
return image_list

Sticking two pictures together in python

Short question, I have 2 images. One is imported through:
Image = mpimg.imread('image.jpg')
While the other one is a processed image of the one imported above, this image is first converted from rgb to hls and then back. The outcome of this convertion gives a "list" which is different than the uint8 of the imported image.
When I'm trying to stick these images together with the function:
new_img2[:height,width:width*2]=image2
I don't see the second image in the combined image while by plotting the image through:
imgplot = plt.imshow(image2)
plt.show()
It works fine. What is the best way to convert the orignal to a "list" and then combine them or the "list" to uint8?
For some more information, the outcome has to be something like this:
enter image description here
Where the right side is black because the image I try to import in it has another type of array. The left image was an uint8 while the other is a "list". The second image is this one, which is saved from python:
enter image description here
Not sure how to do it the way you have show above but I have always been able to merge and save images as shown below!
def mergeImages(image1, image2, dir):
'''
Merge Image 1 and Image 2 side by side and delete the origional
'''
#adding a try/except would cut down on directory errors. not needed if you know you will always open correct images
if image1 == None:
image1.save(dir)
os.remove(image2)
return
im1 = Image.open(image1) #open image
im1.thumbnail((640,640)) #scales the image to 640, 480. Can be changed to whatever you need
im2 = Image.open(image2) #open Image
im1.thumbnail((640,480)) #Again scale
new_im = Image.new('RGB', (2000,720)) #Create a blank canvas image, size can be changed for your needs
new_im.paste(im1, (0,0)) #pasting image one at pos (0,0), can be changed for you
new_im.paste(im2, (640,0)) #again pasting
new_im.save(dir) #save image in defined directory
os.remove(image1) #Optionally deleting the origonal images, I do this to save on space
os.remove(image2)
After a day of searching I found out that both variables can be changed to the type of a float64. The "list" variable:
Image = np.asarray(Image)
This creates an float 64 from a List variable. While the uint8 can be changed to a float64 by:
Image2=np.asarray(Image2/255)
Than the 2 can be combined with:
totalImgage = np.hstack((Image,Image2))
Which than creates the wanted image.

Categories

Resources