How to remove boxes around shx text without AutoCAD? - python

I try to use OCR (Optical Character Reader) for a lot of documents of the same type. I use pdf2image library for Python. But when it sees pdfs with AutoCAD shx text it captures the bounding boxes around text as well. At first they are not visible on pdf. You need to click on text to see the boxes. But they appear in jpg result after conversion.
Here's an image of a part of pdf document:
crop from actual pdf
And here's the output of conversion: pdf after conversion
I expect somethink like that:
pdf after conversion how it should be
Here's my function for pdf2image conversion:
def get_image_for_blueprint(path, dpi):
"""Create image for full blueprint from path
path: path to pdf file
"""
from pdf2image import convert_from_bytes
images = convert_from_bytes(open(path, 'rb').read(), dpi=dpi) # Actual conversion function
for i in images:
width, height = i.size
if width < height:
continue
else:
print(i.size) # tuple : (width, height)
image = i.resize((4096, int(height / width * 4096)))
enhancer = ImageEnhance.Sharpness(image)
image = enhancer.enhance(2) # Sharpness
enhancer = ImageEnhance.Color(image)
image = enhancer.enhance(0) # black and white
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2) # Contrast
image = np.asarray(image) # array
image = image.astype(np.uint8)
return image
I found solutions to be made in AutoCAD before saving the document but I can not really find a way to get rid of these boxes having pdf only.(in python or c++)
Maybe it's possible to resolve using any other programming language library or additional software.

Related

Python Pytesseract - unable to read the image

I am new to pyteseract OCR. I was able to read simple pic and get the test message. I am unable to read test from attached png file. From the attached image, i am trying to extract ishSilver and (SLV) in text.
I would appreciate it if you could help on this. Thank you in advance.
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Tesseract-OCR\tesseract.exe' # actual path to tesseract
im = Image.open('Images/' + '1680.png')
# Size of the image in pixels (size of orginal image)
# (This is not mandatory)
width, height = im.size
# Setting the points for cropped image
left = width*.88
top = 0
right = width
bottom = height
# Cropped image of above dimension
# (It will not change orginal image)
im1 = im.crop((left, top, right, bottom))
# Capturing the image in image viewer and converting to text
text = pytesseract.image_to_string(im1, lang="eng")
print(text)
print("/n")
# Finding the "(" and ")" to the actual stock ticker and save the text in symbol for placeorder to access the file
ticker = text[text.find("(")+1:text.find(")")]
print(ticker)
enter image description here
enter image description here
enter image description here

Python GUI to label and train SLIC images

Hello I am currently trying to develop a python GUI using tkinter to train a CNN. For that I need a button to label SLIC superpixel images. Can someone help me out on how to do that.
This is the part of a code I am using but I need make SLIC images appear in GUI instead of separate window.
'''
def _load_image(path):
"""
Loads and resizes an image from a given path using the Pillow library
:param path: Path to image
:return: Resized or original image
"""
image = Image.open(path);segments = slic(image, n_segments = 300, compactness=30, sigma = 1);fig = plt.figure("Superpixels -- %d segments%");ax = fig.add_subplot(1, 1, 1);pimg=mark_boundaries(image, segments);ax.imshow(pimg);plt.axis("off");x=plt.show()
if(resize):
max_height = 500
img = image;
s = img.size
ratio = max_height / s[1]
image = img.resize((int(s[0]*ratio), int(s[1]*ratio)), Image.ANTIALIAS)
return pimg
'''
Kindly help me out in this

Replacing Images with Image Names instead in Pdf using pymupdf

Using PyMuPDF, I want to extract all images from pdf and save them separately and replace all images in pdf with just their image names at the same image place and save as another document. I can save all images with following code.
import fitz
#This creates the Document object doc
doc = fitz.open("Article_Example_1_2.pdf")
html_text=""
for i in range(len(doc)):
print(doc[i]._getContents())
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha < 4: # this is GRAY or RGB or pix.n < 5
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
doc.save(filename=r"new.pdf")
doc.close()
but not sure how to replace them all in pdf with their stored images names. Would greatly appreciate if anyone can help me out here.
Message from the repo maintainer:
I am not sure whether we have discussed this in the issue blog of the repo. What you can do is using the new feature "redaction annotation". Basic approach:
Calculate the bbox of each image via Page.getImageBbox().
Add a redaction annotation via Page.addRedactAnnot(bbox, text=filename, ...).
When finished with the page, execute Page.apply_redactions(). This will remove all images and all redactions. The chosen filename will appear in the former image bbox.
Save as a new document.
Make sure to use PyMuPDF v1.17.0 or later.

how can I put multiple images and texts on a background image using Python PIL such that texts don't overlap with image?

I'm trying to put multiple images and texts on top of the background image. Making use of python PIL image library. The number of images and text printing won't need to be the same always. Text and image printing has to be separate. The text has to be printed outside anywhere image bounding box.
I'm using below code to make this happen
from PIL import Image
import os, random
with open('C:/Users/nike/Desktop/namelist.txt', "r") as word_list:
words = list(word_list)
k=[]
for i in words:
j = i.replace(' ','').replace('\n','')
k.append(j)
folder=r"C:/Users/nike/Desktop/imagefolder"
a=random.choice(os.listdir(folder))
file = folder+'//'+a
random_text=random.choice(k)
img = Image.open(file)
img_w, img_h = img.size
background = Image.open('C:/Users/nike/Desktop/backgroundimages/back.jpeg','r')
bg_w, bg_h = background.size
offset = ((bg_w - img_w) // 2, (bg_h - img_h) // 2)
draw = ImageDraw.Draw(background)
background.paste(img, offset)
font = ImageFont.truetype("C:/Users/nike/Desktop/open-sans/abc.ttf", 16)
draw.text((0, 0),random_text,(255,255,255),font=font)
background.save('out.png')
above code does printing of one image at the center of background image and text at the (0,0) coordinate of background image. How can I make multiple text and images to paste on background images such that text(x,y) do not print on image (x,y). Any suggestion will be helpful.
Example:
expected result
on the background image, I need to copy images and texts such that they won't overlap eachother.

Get DPI information of PDF image Python

I have pdf file in which an image is embedded in it how i can get the DPI information of that particular image using python.
i tried using "pdfimages" popler-util it gives me the height and width in pixels.
But how i can get the DPI of image from that.
Like the PostScript format or the EPS format, a PDF file has no resolution because it is a vectorial format. All you can do is retrieving the image dimensions in pt (or pixels):
from PyPDF2 import PdfFileReader
with io.open(path, mode="rb") as f:
input_pdf = PdfFileReader(f)
media_box = input_pdf.getPage(0).mediaBox
min_pt = media_box.lowerLeft
max_pt = media_box.upperRight
pdf_width = max_pt[0] - min_pt[0]
pdf_height = max_pt[1] - min_pt[1]

Categories

Resources