Avoid extracting repeat images from PDF

Avoid extracting repeat images from PDF - python

I'm attempting to use Optical Character Recognition on scanned image pdfs in python. This requires extracting the images within the text, and this is where I've run into issues. Some of the PDFs I'm extracting from have a ~header image on every page that gets extracted once per page. Is there a way to avoid this? I'm primarily trying to reduce the number of images I have to feed into my OCR algorithm.
Currently I do image extraction with the following two methods, though I'm ok with using a different method (though I've had multiple hours of difficulty just trying to install textract and haven't gotten it so far, so maybe not that package).
Method 1: Poppler's pdfimages tool via command line via os
def image_exporter(pdf_path, output_dir):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
cmd = ['pdfimages', '-png', '-p', pdf_path,
'{}/prefix'.format(output_dir)]
subprocess.call(cmd)
print('Images extracted')
Method 2: Fitz/PyMuPDF
def img_extract(pdf_path, output_dir):
name_start = pdf_path.split('\\')[-1][:15]
doc = fitz.open(pdf_path)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
name = "p%s-%s.png" % (i, xref)
name = name_start + ' ' + name
name = output_dir + '\\' + name
if pix.n < 5: # this is GRAY or RGB
#pix.writePNG(name)
pix.writeImage(name)
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
#pix1.writePNG(name)
pix1.writeImage(name)
pix1 = None
pix = None
Both of these are essentially copies of code I found elsewhere (Extract images from PDF without resampling, in python? being one of the sources).
I should also mention that I have very little understanding of the structure of a pdf document itself. So I'm having other odd things happening (extracting inversed color images, super blurred images, image1 of pg_x where its got text with random letters missing & image2 of same pg_x with the random missing letters but none of the letters in image1, etc). So perhaps an equally valid question is if there is a way to combine all the images on one page into a single image that I can scan with my OCR code? I'm primarily trying to avoid having to scan through huge quantities of images.

Related

Replacing a word with another word, and replacing an image with another image in a PDF file through python, is this possible?

I need to replace a K words with K other words for every PDF file I have within a certain path file location and on top of this I need to replace every logo with another logo. I have around 1000 PDF files, and so I do not want to use Adobe Acrobat and edit 1 file at a time. How can I start this?
Replacing words seems at least doable as long as there is a decent PDF reader one can access through Python ( Note I want to do this task in Python ), however replacing an image might be more difficult. I will most likely have to find the dimension of the current image and resize the image being used to replace the current image dynamically, whilst the program runs through these PDF files.
Hi, so I've written down some code regarding this:
from pikepdf import Pdf, PdfImage, Name
import os
import glob
from PIL import Image
import zlib
example = Pdf.open(r'...\Likelihood.pdf')
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
if len(list(i.images.keys())) >= 1:
PagesWithImages.append(i)
ImageCodesForPages.append(list(i.images.keys()))
pdfImages = []
for i,j in zip(PagesWithImages, ImageCodesForPages):
for x in j:
pdfImages.append(i.images[x])
# Replace every single page using random image, ensure that the dimensions remain the same?
for i in pdfImages:
pdfimage = PdfImage(i)
rawimage = pdfimage.obj
im = Image.open(r'...\panda.jpg')
pillowimage = pdfimage.as_pil_image()
print(pillowimage.height)
print(pillowimage.width)
im = im.resize((pillowimage.width, pillowimage.height))
im.show()
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
So just one problem, it doesn't actually replace anything. If you're wondering why and how I wrote this code I actually got it from this documentation:
https://buildmedia.readthedocs.org/media/pdf/pikepdf/latest/pikepdf.pdf
Start at Page 53
I essentially put all the pdfImages into a list, as 1 page can have multiple images. In conjunction with this, the last for loop essentially tries to replace all these images whilst maintaining the same width and height size. Also note, the file path names I changed here and it definitely is not the issue.
Again Thank You

I have figured out what I was doing wrong. So for anyone that wants to actually replace an image with another image in place on a PDF file what you do is:
from pikepdf import Pdf, PdfImage, Name
from PIL import Image
import zlib
example = Pdf.open(filepath, allow_overwriting_input=True)
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
imagelists = list(i.images.keys())
if len(imagelists) >= 1:
for x in imagelists:
rawimage = i.images[x]
pdfimage = PdfImage(rawimage)
rawimage = pdfimage.obj
pillowimage = pdfimage.as_pil_image()
im = Image.open(imagePath)
im = im.resize((pillowimage.width, pillowimage.height))
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
rawimage.Width, rawimage.Height = pillowimage.width, pillowimage.height
example.save()
Essentially, I changed the arguements in the first line, such that I specify that I can overwrite. In conjunction, I also added the last line which actually allows me to save.

How Can I Use Python to Output Mutiple PNGs as a layered Image File PSD/XCF

I have looked at mutiple answers before posting this question but am finding these difficult to understand.
The question is:
I currently have a script which combines a total of 7 pngs from 7 different folders to create a single png image.
The script chooses a png from each folder randomly to create a unique combination for the final single png image.
The script allows you to set how many images you want, and dependent on the number of images chosen that number of images are created, each with its own unique combination.
What I would like to do is create either an XCF,PSD or any other easily accessible/opensource layered image file for each combination created, each layer should be each png file chosen from each of the 7 folders.
I cannot post the full code here, but the code which is most relevant to the question is here:
Create each composite
com1 = Image.alpha_composite(im1, im2)
com2 = Image.alpha_composite(com1, im3)
com3 = Image.alpha_composite(com2, im4)
com4 = Image.alpha_composite(com3, im5)
com5 = Image.alpha_composite(com4, im6)
com6 = Image.alpha_composite(com5, im7)
# Convert to RGB
# rgb_im = com5.convert('RGB')
rgb_im = com6.convert('RGBA')
file_name = str(item["tokenId"]) + ".png"
rgb_im.save("./images/" + file_name)
dump_meta(all_images[count - 1]
Any help would be greatly appreciated and thank you for taking the time to read this.

how do i increase the resolution of my gif file?

As I am trying to create a gif file, the file has been created successfully but it is pixelating. So if anyone can help me out with how to increase resolution.
.Here is the code:-
import PIL
from PIL import Image
import NumPy as np
image_frames = []
days = np.arange(0, 12)
for i in days:
new_frame = PIL.Image.open(
r"C:\Users\Harsh Kotecha\PycharmProjects\pythonProject1\totalprecipplot" + "//" + str(i) + ".jpg"
)
image_frames.append(new_frame)
image_frames[0].save(
"precipitation.gif",
format="GIF",
append_images=image_frames[1:],
save_all="true",
duration=800,
loop=0,
quality=100,
)
Here is the Gif file:-
Here are the original images:-
image1
image2
iamge3

Updated Answer
Now that you have provided some images I had a go at disabling the dithering:
#!/usr/bin/env python3
from PIL import Image
# User editable values
method = Image.FASTOCTREE
colors = 250
# Load images precip-01.jpg through precip-12.jpg, quantize to common palette
imgs = []
for i in range(1,12):
filename = f'precip-{i:02d}.jpg'
print(f'Loading: {filename}')
try:
im = Image.open(filename)
pImage = im.quantize(colors=colors, method=method, dither=0)
imgs.append(pImage)
except:
print(f'ERROR: Unable to open {filename}')
imgs[0].save(
"precipitation.gif",
format="GIF",
append_images=imgs[1:],
save_all="true",
duration=800,
loop=0
)
Original Answer
Your original images are JPEGs which means they likely have many thousands of colours 2. When you make an animated GIF (or even a static GIF) each frame can only have 256 colours in its palette.
This can create several problems:
each frame gets a new, distinct palette stored with it, thereby increasing the size of the GIF (each palette is 0.75kB)
colours get dithered in an attempt to make the image look as close as possible to the original colours
different colours can get chosen for frames that are nearly identical which means colours flicker between distinct shades on successive frames - can cause "twinkling" like stars
If you want to learn about GIFs, you can learn 3,872 times as much as I will ever know by reading Anthony Thyssen's excellent notes here, here and here.
Your image is suffering from the first problem because it has 12 "per frame" local colour tables as well as a global colour table3. It is also suffering from the second problem - dithering.
To avoid the dithering, you probably want to do some of the following:
load all images and append them all together into a 12x1 monster image, and find the best palette for all the colours. As all your images are very similar, I think that you'll get away with generating a palette just from the first image without needing to montage all 12 - that'll be quicker
now palettize each image, with dithering disabled and using the single common palette
save your animated sequence of the palletised images, pushing in the singe common palette from the first step above
2: You can count the number of colours in an image with ImageMagick, using:
magick YOURIMAGE -format %k info:
3: You can see the colour tables in a GIF with gifsicle using:
gifsicle -I YOURIMAGE.GIF

Easily readable text not recognized by tesseract

I have used the following PyTorch implementation of EAST ( Efficient and Accurate Scene Text Detector) to identify and draw bounding boxes around text in a number of images and it works very well!
However, the next step of OCR which I am trying with pytesseract in order to extract the text form these images and converting them to strings - is failing horribly. Using all possible configurations of --oem and --psm, I am unable to get pytesseract to detect what appears to be very clear text, for example:
The recognized text is below the images. Even though I have applied contrast enhancement, and also tried dilating and eroding, I cannot get tesseract to recognize the text. This is just one example of many images where the text is even larger and clearer. Any suggestions on transformations, configs, or other libraries would be helpful!
UPDATE: After trying Gaussian blur + Otso thresholding, I am able to get black text on white background (apparently which is ideal for pytesseract), and also added Spanish language, but it still cannot read very plain text - for example:
reads as gibberish.
The processed text images are and
and the code I am using:
img_path = './images/fesa.jpg'
img = Image.open(img_path)
boxes = detect(img, model, device)
origbw = cv2.imread(img_path, 0)
for box in boxes:
box = box[:-1]
poly = [(box[0], box[1]),(box[2], box[3]),(box[4], box[5]),(box[6], box[7])]
x = []
y = []
for coord in poly:
x.append(coord[0])
y.append(coord[1])
startX = int(min(x))
startY = int(min(y))
endX = int(max(x))
endY = int(max(y))
#use pre-defined bounding boxes produced by EAST to crop the original image
cropped_image = origbw[startY:endY, startX:endX]
#contrast enhancement
clahe = cv2.createCLAHE(clipLimit=4.0, tileGridSize=(8,8))
res = clahe.apply(cropped_image)
text = pytesseract.image_to_string(res, config = "-psm 12")
plt.imshow(res)
plt.show()
print(text)

Use these updated data files.
This guide criticizes out-of-the box performance (and maybe the accuracy could be affected too):
Trained data. On the moment of writing, tesseract-ocr-eng APT package for Ubuntu 18.10 has terrible out of the box performance, likely because of corrupt training data.
According to the following test I did, using the updated data files seems to provide better results. This is the code I used:
import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('farmacias.jpg'), lang='spa', config='--tessdata-dir ./tessdata --psm 7'))
I downloaded spa.traineddata (your example images have Spanish words, right?) to ./tessdata/spa.traineddata. And the result was:
ARMACIAS
And for the second image:
PECIALIZADA:
I used --psm 7 because here it says that it means "Treat the image as a single text line" and I thought that should make sense for your test images.
In this Google Colab you can see the test I did.

Tesseract could not recognize text from a pdf file

I tried to use Tesseract in Python to OCR some PDFs. The workflow is to convert a PDF to a series of images first using wand, then send them to Tesseract based on this example. I applied this to 5 PDFs but found it failed to convert one (completely failed). It works fine to convert PDF to Tiff. Thus, I guess maybe something needs to be tuned in the OCR process? Or any other tools I should use to deal with this situation? I tried xpdfbin-win-3.04 which worked on this PDF but did not work as well as Tesseract on the other PDFs...
Screenshot of failed PDF
Output text
Code
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()2
pth_str = "C:/Users/TH/Desktop/OCR_test/"
fname_list = ["999437-Asb_1-34.pdf"]
for each_file in fname_list:
print each_file
req_image = []
final_text = []
# convert to tiff
image_pdf = Image(filename=pth_str+each_file, resolution=600)
image_tif = image_pdf.convert('tiff')
for img in image_tif.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('tiff'))
# begin OCR
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt.encode('ascii','ignore'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoid extracting repeat images from PDF - python

Related

Replacing a word with another word, and replacing an image with another image in a PDF file through python, is this possible?

How Can I Use Python to Output Mutiple PNGs as a layered Image File PSD/XCF

how do i increase the resolution of my gif file?

Easily readable text not recognized by tesseract

Tesseract could not recognize text from a pdf file

Categories

Resources