Having Problems with extracting only a Rect object in PyMuPDF

Having Problems with extracting only a Rect object in PyMuPDF - python

I tried the solution from this thread here
Read specific region from PDF
Sadly the following example from the thread by user Zach Young doesn't work for me.
import os.path
import fitz
from fitz import Document, Page, Rect
# For visualizing the rects that PyMuPDF uses compared to what you see in the PDF
VISUALIZE = True
input_path = "test.pdf"
doc: Document = fitz.open(input_path)
for i in range(len(doc)):
page: Page = doc[i]
page.clean_contents() # https://pymupdf.readthedocs.io/en/latest/faq.html#misplaced-item-insertions-on-pdf-pages
# Hard-code the rect you need
rect = Rect(0, 0, 100, 100)
if VISUALIZE:
# Draw a red box to visualize the rect's area (text)
page.draw_rect(rect, width=1.5, color=(1, 0, 0))
text = page.get_textbox(rect)
print(text)
if VISUALIZE:
head, tail = os.path.split(input_path)
viz_name = os.path.join(head, "viz_" + tail)
doc.save(viz_name)
but if I set my respective values for the Rect object (which do seem reasonable) and issue
print(rect.is_empty)
it outputs True.
Also it doesn't draw the rectangle as it obviously should. There is obviously no output from
text = page.get_textbox(rect)
But if I just issue
text = page.get_text()
that gives me some correct output.
However I wonder what is the reason that it says that the rect is empty because I would eagerly need it to only extract the text from a certain area.
Thanks

Nothing is delivered in the following case, because none of the characters is fully contained in the box:
Here the output of another page. For the output is this:
#ifnde
#defin
/*

Related

How to detect colored blocks in a PDF file with python (pdfminer, minecart, tabula...)

I am trying to extract quite a few tables from a PDF file. These tables are sort of conveniently "highlighted" with different colors, which makes it easy for eyes to catch (see the example screenshot).
I think it would be good to detect the position/coordinates of those colored blocks, and use the coordinates to extract tables.
I have figured out the table extraction part (using tabula-py). So it is the first step stopping me. From what I gathered minecart is the best tool for color and shapes in PDF files, except full scale imaging processing with OpenCV. But I have no luck with detecting colored box/block coordinates.
Would appreciate any help!!

I think I got a solution:
import minecart
pdffile = open(fn, 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(page_num) # page_num is 0-based
for shape in page.shapes.iter_in_bbox((0, 0, 612, 792 )):
if shape.fill:
shape_bbox = shape.get_bbox()
shape_color = shape.fill.color.as_rgb()
print(shape_bbox, shape_color)
I would then need to filter the color or the shape size...
My earlier failure was due to having used a wrong page number :(

PyMuPDF lets you extract so-called "line art": the vector drawings on a page.
This is a list of dictionaries of "paths" (as PDF calls interconnected drawings) from which you can sub-select ones of interest for you.
E.g. the following identifies drawings that represent filled rectangles, not too small:
page = doc[0] # load some page (here page 0)
paths = page.get_drawings() # extract all vector graphics
filled_rects = [] # filled rectangles without border land here
for path in paths:
if path["type"] != "f" # only consider paths with a fill color
continue
rect = path["rect"]
if rect.width < 20 or rect.height < 20: # only consider sizable rects
continue
filled_rects.append(rect) # hopefully an area coloring a table
# make a visible border around the hits to see success:
for rect in filled_rects:
page.draw_rect(rect, color=fitz.pdfcolor["red"])
doc.save("debug.pdf")

Pillow - Transparency over non-transparent image with paste

Let me prefix with a disclaimer that I am clueless when it comes to imaging/graphics all together, so maybe I'm lacking a fundamental understanding with something here.
I'm trying to paste an image (game_image) to my base image (image) with a transparent overlay (overlay_image) over top to add some darkening for the text.
Here's an example of the expected result:
Here's an example of what my current code generates:
Here is my current code:
from PIL import Image, ImageFont, ImageDraw
# base image sizing specific to Twitter recommended
base_image_size = (1600, 900)
base_image_mode = "RGBA"
base_image_background_color = (0, 52, 66)
image = Image.new(base_image_mode, base_image_size, base_image_background_color)
# game_image is the box art image on the left side of the card
game_image = Image.open("hunt.jpg")
image.paste(game_image)
# overlay_image is the darkened overlay over the left side of the card
overlay_image = Image.new(base_image_mode, base_image_size, (0, 0, 0))
overlay_image.putalpha(128)
# x position should be negative 50% of base canvas size
image.paste(overlay_image, (-800, 0), overlay_image)
image.save("test_image.png", format="PNG")
You can see that the game image sort of inherits the transparency from the overlay. I suspect it has something to do with the mask added in my paste above, but I tried looking into what masking is & its just beyond my understanding in any context I find it in.
Any help on understanding why this occurs and/or how I can resolve is appreciated!

You are super close... All you need, is to use Image.alpha_composite instead of paste. So, the last two lines of your code should be:
image = Image.alpha_composite(image, overlay_image)
image.save("test_image.png", format="PNG")

How to iterate through excel spreadsheet rows in Python?

I have written a script that draws rectangles around features of an image according to their x/y/r pixel coordinates, and this is all functioning well. The functioning code is as follows:
ss = pd.read_excel(xeno_data)
fandc = []
for index, row in ss.head().iterrows():
filename = row['filename']
coords = row['xyr_coords']
# Use RegEx to find anything that looks like a group of digits, possibly seperated by decimal point.
x, y, r = re.findall(r'[0-9.]+',coords)
print(f'DEBUG: filename={filename}, x={x}, y={y}, r={r}')
fandc.append({'filename': filename, 'x':x, 'y':y, 'r':r})
#Draw a transparent rectangle:
im = im.convert('RGBA')
overlay = Image.new('RGBA', im.size)
draw = ImageDraw.Draw(overlay)
#The x,y,r coordinates are centre of sponge (x,y) and radius (r).
draw.rectangle(((float(fandc[0]['x'])-float(fandc[0]['r']), float(fandc[0]['y'])-float(fandc[0]['r'])), (float(fandc[0]['x'])+float(fandc[0]['r']), float(fandc[0]['y'])+float(fandc[0]['r']))), fill=(255,0,0,55))
img = Image.alpha_composite(im, overlay)
img = img.convert("RGB")
# Remove alpha for saving in jpg format.
img.show()
This code produces the desired result, and you can see from that it has succesfully drawn a faded red rectangle over a feature in the centre-bottom of the image.
However this is tailored to the first row of the data ( 'fandc[0]' ). How do I adjust this code to automatically iterate or loop through each row of my spreadsheet (xeno_data), i.e. 'fandc1', 'fandc[2]', 'fandc[3]', etc, etc.....
Thanks all!

Without having access to the same data, you initially plot based on fandc[0] and want to go through all other rectangles fandc[1], fandc[2], etc. You could then try:
for i in range(len(fandc)):
draw.rectangle(((float(fandc[i]['x'])-float(fandc[i]['r']), float(fandc[i]['y'])-float(fandc[i]['r'])), (float(fandc[i]['x'])+float(fandc[i]['r']), float(fandc[i]['y'])+float(fandc[i]['r']))), fill=(255,0,0,55))
See how we replace our initial index 0 with our iterating index i.
If you are struggling getting for loops to work it is probably wise to do an online tutorial on them, and practicing them with simpler code. See https://www.w3schools.com/python/python_for_loops.asp for more info

Cannot align text with a line drawn on an image

I'm trying to do some image manipulation with the python library Pillow (fork of PIL) and am coming across a weird problem. For some reason, when I try to draw a line and draw some text at the same y coordinate, they're not matching up. The text is a bit below the line, yet I have both graphics starting at the same point. Has anyone had this problem before and/or know how to solve it? Here's the code I'm using:
image = Image.open("../path_to_image/image.jpg")
draw = ImageDraw.Draw(image)
font = ImageFont.truetype("../fonts/Arial Bold.ttf", 180)
draw.line((0,2400, 500,2400), fill="#FFF", width=1)
draw.text((0, 2400), "Test Text", font=font)
image.save(os.path.join(root, "test1.jpg"), "JPEG", quality=100)
return

I get something similar (with sizes 10 times smaller):
This is happening because the (x,y) coordinates given to ImageDraw.text() are the top left corner of the text:
PIL.ImageDraw.Draw.text(xy, text, fill=None, font=None, anchor=None)
Draws the string at the given position.
Parameters:
xy – Top left corner of the text.
text – Text to be drawn.
font – An ImageFont instance.
fill – Color to use for the text.
This is confirmed in the code: the text is turned into a bitmap and then drawn at xy.

For those with a similar problem, I ended up creating a helper function that manually adjusts the font size until font.getsize(text)[1] returns the correctly sized text. Here's a snippet:
def adjust_font_size_to_line_height(font_location, desired_point_size, text):
adjusted_points = 1
while True:
font = ImageFont.truetype(font_location, adjusted_points)
height = font.getsize(text)[1]
if height != desired_point_size:
adjusted_points += 1
else:
break
return adjusted_points

Cropping pages of a .pdf file

I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size.
After a quick Google search I found the pyPdf library for python but my experiments with it failed. When I changed the cropBox and trimBox attributes on a page object the results were not what I had expected and appeared to be quite random.
Has anyone had any experience with this? Code examples would be well appreciated, preferably in python.

pyPdf does what I expect in this area. Using the following script:
#!/usr/bin/python
#
from pyPdf import PdfFileWriter, PdfFileReader
with open("in.pdf", "rb") as in_f:
input1 = PdfFileReader(in_f)
output = PdfFileWriter()
numPages = input1.getNumPages()
print "document has %s pages." % numPages
for i in range(numPages):
page = input1.getPage(i)
print page.mediaBox.getUpperRight_x(), page.mediaBox.getUpperRight_y()
page.trimBox.lowerLeft = (25, 25)
page.trimBox.upperRight = (225, 225)
page.cropBox.lowerLeft = (50, 50)
page.cropBox.upperRight = (200, 200)
output.addPage(page)
with open("out.pdf", "wb") as out_f:
output.write(out_f)
The resulting document has a trim box that is 200x200 points and starts at 25,25 points inside the media box.
The crop box is 25 points inside the trim box.
Here is how my sample document looks in acrobat professional after processing with the above code:
This document will appear blank when loaded in acrobat reader.

Use this to get the dimension of pdf
from PyPDF2 import PdfWriter, PdfReader, PdfMerger
reader = PdfReader("/Users/user.name/Downloads/sample.pdf")
page = reader.pages[0]
print(page.cropbox.lower_left)
print(page.cropbox.lower_right)
print(page.cropbox.upper_left)
print(page.cropbox.upper_right)
After this get page reference and then apply crop command
page.mediabox.lower_right = (lower_right_new_x_coordinate, lower_right_new_y_coordinate)
page.mediabox.lower_left = (lower_left_new_x_coordinate, lower_left_new_y_coordinate)
page.mediabox.upper_right = (upper_right_new_x_coordinate, upper_right_new_y_coordinate)
page.mediabox.upper_left = (upper_left_new_x_coordinate, upper_left_new_y_coordinate)
#f or example :- my custom coordinates
# page.mediabox.lower_right = (611, 500)
# page.mediabox.lower_left = (0, 500)
# page.mediabox.upper_right = (611, 700)
# page.mediabox.upper_left = (0, 700)

How do I know the coordinates to crop?
Thanks for all answers above.
Step 1. Run the following code to get (x1, y1).
from PyPDF2 import PdfWriter, PdfReader
reader = PdfReader("test.pdf")
page = reader.pages[0]
print(page.cropbox.upper_right)
Step 2. View the pdf file in full screen mode.
Step 3. Capture the screen as an image file screen.jpg.
Step 4. Open screen.jpg by MS paint or GIMP. These applications show the coordinate of the cursor.
Step 5. Remember the following coordinates, (x2, y2), (x3, y3), (x4, y4) and (x5, y5), where (x4, y4) and (x5, y5) determine the rectangle you want to crop.
Step 6. Get page.cropbox.upper_left and page.cropbox.lower_right by the following formulas. Here is a tool for calculating.
page.cropbox.upper_left = (x1*(x4-x2)/(x3-x2),(1-y4/y3)*y1)
page.cropbox.lower_right = (x1*(x5-x2)/(x3-x2),(1-y5/y3)*y1)
Step 7. Run the following code to crop the pdf file.
from PyPDF2 import PdfWriter, PdfReader
reader = PdfReader('test.pdf')
writer = PdfWriter()
for page in reader.pages:
page.cropbox.upper_left = (100,200)
page.cropbox.lower_right = (300,400)
writer.add_page(page)
with open('result.pdf','wb') as fp:
writer.write(fp)

You are probably looking for a free solution, but if you have money to spend, PDFlib is a fabulous library. It has never disappointed me.

You can convert the PDF to Postscript (pstopdf or ps2pdf) and than use text processing on the Postscript file. After that you can convert the output back to PDF.
This works nicely if the PDFs you want to process are all generated by the same application and are somewhat similar. If they come from different sources it is usually to hard to process the Postscript files - the structure is varying to much. But even than you migt be able to fix page sizes and the like with a few regular expressions.

Acrobat Javascript API has a setPageBoxes method, but Adobe doesn't provide any Python code samples. Only C++, C# and VB.

Cropping pages of a .pdf file
from PIL import Image
def ImageCrop():
img = Image.open("page_1.jpg")
left = 90
top = 580
right = 1600
bottom = 2000
img_res = img.crop((left, top, right, bottom))
with open(outfile4, 'w') as f:
img_res.save(outfile4,'JPEG')
ImageCrop()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having Problems with extracting only a Rect object in PyMuPDF - python

Nothing is delivered in the following case, because none of the characters is fully contained in the box: Here the output of another page. For the output is this: #ifnde #defin /*

Related

How to detect colored blocks in a PDF file with python (pdfminer, minecart, tabula...)

Pillow - Transparency over non-transparent image with paste

How to iterate through excel spreadsheet rows in Python?

Cannot align text with a line drawn on an image

Cropping pages of a .pdf file

Categories

Resources