I try to use OCR (Optical Character Reader) for a lot of documents of the same type. I use pdf2image library for Python. But when it sees pdfs with AutoCAD shx text it captures the bounding boxes around text as well. At first they are not visible on pdf. You need to click on text to see the boxes. But they appear in jpg result after conversion.
Here's an image of a part of pdf document:
crop from actual pdf
And here's the output of conversion: pdf after conversion
I expect somethink like that:
pdf after conversion how it should be
Here's my function for pdf2image conversion:
def get_image_for_blueprint(path, dpi):
"""Create image for full blueprint from path
path: path to pdf file
"""
from pdf2image import convert_from_bytes
images = convert_from_bytes(open(path, 'rb').read(), dpi=dpi) # Actual conversion function
for i in images:
width, height = i.size
if width < height:
continue
else:
print(i.size) # tuple : (width, height)
image = i.resize((4096, int(height / width * 4096)))
enhancer = ImageEnhance.Sharpness(image)
image = enhancer.enhance(2) # Sharpness
enhancer = ImageEnhance.Color(image)
image = enhancer.enhance(0) # black and white
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2) # Contrast
image = np.asarray(image) # array
image = image.astype(np.uint8)
return image
I found solutions to be made in AutoCAD before saving the document but I can not really find a way to get rid of these boxes having pdf only.(in python or c++)
Maybe it's possible to resolve using any other programming language library or additional software.
I am trying to export an image as a .pdf file in DIN A4 size. For that I am using PIL (python imaging library).
My problem is that the resulting pdf file does not seem to have the correct size. Using 300 DPI the size should be 2480x3507 pixels or 210x297 millimeters.
Depending on the application I open the file with, I get different sizes:
Adobe Acrobat Reader: 210x297 millimeters (correct size; please note next error)
Microsoft Edge: 280x396 millimeters
While the images printed on the page are looking correctly in MS Edge, I only get a fraction of two overlapping images in Adobe Acrobat Reader.
I would like to store the page in DIN A4 and be able to print it out. However, with those wrong sizes the images appear stretched on paper.
To get the image I convert the size of the page (21 x 29.7 cm) into pixels with a given DPI using the following function:
def cm_to_px(centimeters):
return (300 * centimeters) / 2.54 # pixels with resolution of 300 DPI
To paste the images on an A4 paper, I use this code:
# get the page image with the correct size
page = Image.new('1', (cm_to_px(21), cm_to_px(29.7)), 1)
# paste the images on the page
for y in range(0, vertical_per_page):
for x in range(0, horizontal_per_page):
page.paste(image, (img_w, img_h))
# save the image as pdf file
page.save('file.pdf', resolution=300)
I have a dicom file from which I read images. The images I read, however, has incorrect colormap. Ideally, the image should look like:
However, the following code only gives me
If I only take the red component, I get the image below, which is not correct and cannot be adjusted to the ideal result in any colormap I tried.
or
root = tk.Tk()
root.withdraw()
path = filedialog.askopenfilename()
ds = dicom.dcmread(path, force = True) # reads a file data set
video = ds.pixel_array #reads a sequence of RGB images
plt.imsave(some_path, video[0], format='png') #gives image [2]
What have I done wrong?
This really looks like YCbCr data, is the Photometric Interpretation something like YBR_FULL? If so then as mentioned in the documentation you need to apply a colour space conversion, which in pydicom is:
from pydicom import dcmread
from pydicom.pixel_data_handlers import convert_color_space
ds = dcmread(...)
rgb = convert_color_space(ds.pixel_array, "YBR_FULL", "RGB")
I am new to pyteseract OCR. I was able to read simple pic and get the test message. I am unable to read test from attached png file. From the attached image, i am trying to extract ishSilver and (SLV) in text.
I would appreciate it if you could help on this. Thank you in advance.
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Tesseract-OCR\tesseract.exe' # actual path to tesseract
im = Image.open('Images/' + '1680.png')
# Size of the image in pixels (size of orginal image)
# (This is not mandatory)
width, height = im.size
# Setting the points for cropped image
left = width*.88
top = 0
right = width
bottom = height
# Cropped image of above dimension
# (It will not change orginal image)
im1 = im.crop((left, top, right, bottom))
# Capturing the image in image viewer and converting to text
text = pytesseract.image_to_string(im1, lang="eng")
print(text)
print("/n")
# Finding the "(" and ")" to the actual stock ticker and save the text in symbol for placeorder to access the file
ticker = text[text.find("(")+1:text.find(")")]
print(ticker)
enter image description here
enter image description here
enter image description here
I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size.
After a quick Google search I found the pyPdf library for python but my experiments with it failed. When I changed the cropBox and trimBox attributes on a page object the results were not what I had expected and appeared to be quite random.
Has anyone had any experience with this? Code examples would be well appreciated, preferably in python.
pyPdf does what I expect in this area. Using the following script:
#!/usr/bin/python
#
from pyPdf import PdfFileWriter, PdfFileReader
with open("in.pdf", "rb") as in_f:
input1 = PdfFileReader(in_f)
output = PdfFileWriter()
numPages = input1.getNumPages()
print "document has %s pages." % numPages
for i in range(numPages):
page = input1.getPage(i)
print page.mediaBox.getUpperRight_x(), page.mediaBox.getUpperRight_y()
page.trimBox.lowerLeft = (25, 25)
page.trimBox.upperRight = (225, 225)
page.cropBox.lowerLeft = (50, 50)
page.cropBox.upperRight = (200, 200)
output.addPage(page)
with open("out.pdf", "wb") as out_f:
output.write(out_f)
The resulting document has a trim box that is 200x200 points and starts at 25,25 points inside the media box.
The crop box is 25 points inside the trim box.
Here is how my sample document looks in acrobat professional after processing with the above code:
This document will appear blank when loaded in acrobat reader.
Use this to get the dimension of pdf
from PyPDF2 import PdfWriter, PdfReader, PdfMerger
reader = PdfReader("/Users/user.name/Downloads/sample.pdf")
page = reader.pages[0]
print(page.cropbox.lower_left)
print(page.cropbox.lower_right)
print(page.cropbox.upper_left)
print(page.cropbox.upper_right)
After this get page reference and then apply crop command
page.mediabox.lower_right = (lower_right_new_x_coordinate, lower_right_new_y_coordinate)
page.mediabox.lower_left = (lower_left_new_x_coordinate, lower_left_new_y_coordinate)
page.mediabox.upper_right = (upper_right_new_x_coordinate, upper_right_new_y_coordinate)
page.mediabox.upper_left = (upper_left_new_x_coordinate, upper_left_new_y_coordinate)
#f or example :- my custom coordinates
# page.mediabox.lower_right = (611, 500)
# page.mediabox.lower_left = (0, 500)
# page.mediabox.upper_right = (611, 700)
# page.mediabox.upper_left = (0, 700)
How do I know the coordinates to crop?
Thanks for all answers above.
Step 1. Run the following code to get (x1, y1).
from PyPDF2 import PdfWriter, PdfReader
reader = PdfReader("test.pdf")
page = reader.pages[0]
print(page.cropbox.upper_right)
Step 2. View the pdf file in full screen mode.
Step 3. Capture the screen as an image file screen.jpg.
Step 4. Open screen.jpg by MS paint or GIMP. These applications show the coordinate of the cursor.
Step 5. Remember the following coordinates, (x2, y2), (x3, y3), (x4, y4) and (x5, y5), where (x4, y4) and (x5, y5) determine the rectangle you want to crop.
Step 6. Get page.cropbox.upper_left and page.cropbox.lower_right by the following formulas. Here is a tool for calculating.
page.cropbox.upper_left = (x1*(x4-x2)/(x3-x2),(1-y4/y3)*y1)
page.cropbox.lower_right = (x1*(x5-x2)/(x3-x2),(1-y5/y3)*y1)
Step 7. Run the following code to crop the pdf file.
from PyPDF2 import PdfWriter, PdfReader
reader = PdfReader('test.pdf')
writer = PdfWriter()
for page in reader.pages:
page.cropbox.upper_left = (100,200)
page.cropbox.lower_right = (300,400)
writer.add_page(page)
with open('result.pdf','wb') as fp:
writer.write(fp)
You are probably looking for a free solution, but if you have money to spend, PDFlib is a fabulous library. It has never disappointed me.
You can convert the PDF to Postscript (pstopdf or ps2pdf) and than use text processing on the Postscript file. After that you can convert the output back to PDF.
This works nicely if the PDFs you want to process are all generated by the same application and are somewhat similar. If they come from different sources it is usually to hard to process the Postscript files - the structure is varying to much. But even than you migt be able to fix page sizes and the like with a few regular expressions.
Acrobat Javascript API has a setPageBoxes method, but Adobe doesn't provide any Python code samples. Only C++, C# and VB.
Cropping pages of a .pdf file
from PIL import Image
def ImageCrop():
img = Image.open("page_1.jpg")
left = 90
top = 580
right = 1600
bottom = 2000
img_res = img.crop((left, top, right, bottom))
with open(outfile4, 'w') as f:
img_res.save(outfile4,'JPEG')
ImageCrop()