Cropping pages of a .pdf file - python

I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size.
After a quick Google search I found the pyPdf library for python but my experiments with it failed. When I changed the cropBox and trimBox attributes on a page object the results were not what I had expected and appeared to be quite random.
Has anyone had any experience with this? Code examples would be well appreciated, preferably in python.

pyPdf does what I expect in this area. Using the following script:
#!/usr/bin/python
#
from pyPdf import PdfFileWriter, PdfFileReader
with open("in.pdf", "rb") as in_f:
input1 = PdfFileReader(in_f)
output = PdfFileWriter()
numPages = input1.getNumPages()
print "document has %s pages." % numPages
for i in range(numPages):
page = input1.getPage(i)
print page.mediaBox.getUpperRight_x(), page.mediaBox.getUpperRight_y()
page.trimBox.lowerLeft = (25, 25)
page.trimBox.upperRight = (225, 225)
page.cropBox.lowerLeft = (50, 50)
page.cropBox.upperRight = (200, 200)
output.addPage(page)
with open("out.pdf", "wb") as out_f:
output.write(out_f)
The resulting document has a trim box that is 200x200 points and starts at 25,25 points inside the media box.
The crop box is 25 points inside the trim box.
Here is how my sample document looks in acrobat professional after processing with the above code:
This document will appear blank when loaded in acrobat reader.

Use this to get the dimension of pdf
from PyPDF2 import PdfWriter, PdfReader, PdfMerger
reader = PdfReader("/Users/user.name/Downloads/sample.pdf")
page = reader.pages[0]
print(page.cropbox.lower_left)
print(page.cropbox.lower_right)
print(page.cropbox.upper_left)
print(page.cropbox.upper_right)
After this get page reference and then apply crop command
page.mediabox.lower_right = (lower_right_new_x_coordinate, lower_right_new_y_coordinate)
page.mediabox.lower_left = (lower_left_new_x_coordinate, lower_left_new_y_coordinate)
page.mediabox.upper_right = (upper_right_new_x_coordinate, upper_right_new_y_coordinate)
page.mediabox.upper_left = (upper_left_new_x_coordinate, upper_left_new_y_coordinate)
#f or example :- my custom coordinates
# page.mediabox.lower_right = (611, 500)
# page.mediabox.lower_left = (0, 500)
# page.mediabox.upper_right = (611, 700)
# page.mediabox.upper_left = (0, 700)

How do I know the coordinates to crop?
Thanks for all answers above.
Step 1. Run the following code to get (x1, y1).
from PyPDF2 import PdfWriter, PdfReader
reader = PdfReader("test.pdf")
page = reader.pages[0]
print(page.cropbox.upper_right)
Step 2. View the pdf file in full screen mode.
Step 3. Capture the screen as an image file screen.jpg.
Step 4. Open screen.jpg by MS paint or GIMP. These applications show the coordinate of the cursor.
Step 5. Remember the following coordinates, (x2, y2), (x3, y3), (x4, y4) and (x5, y5), where (x4, y4) and (x5, y5) determine the rectangle you want to crop.
Step 6. Get page.cropbox.upper_left and page.cropbox.lower_right by the following formulas. Here is a tool for calculating.
page.cropbox.upper_left = (x1*(x4-x2)/(x3-x2),(1-y4/y3)*y1)
page.cropbox.lower_right = (x1*(x5-x2)/(x3-x2),(1-y5/y3)*y1)
Step 7. Run the following code to crop the pdf file.
from PyPDF2 import PdfWriter, PdfReader
reader = PdfReader('test.pdf')
writer = PdfWriter()
for page in reader.pages:
page.cropbox.upper_left = (100,200)
page.cropbox.lower_right = (300,400)
writer.add_page(page)
with open('result.pdf','wb') as fp:
writer.write(fp)

You are probably looking for a free solution, but if you have money to spend, PDFlib is a fabulous library. It has never disappointed me.

You can convert the PDF to Postscript (pstopdf or ps2pdf) and than use text processing on the Postscript file. After that you can convert the output back to PDF.
This works nicely if the PDFs you want to process are all generated by the same application and are somewhat similar. If they come from different sources it is usually to hard to process the Postscript files - the structure is varying to much. But even than you migt be able to fix page sizes and the like with a few regular expressions.

Acrobat Javascript API has a setPageBoxes method, but Adobe doesn't provide any Python code samples. Only C++, C# and VB.

Cropping pages of a .pdf file
from PIL import Image
def ImageCrop():
img = Image.open("page_1.jpg")
left = 90
top = 580
right = 1600
bottom = 2000
img_res = img.crop((left, top, right, bottom))
with open(outfile4, 'w') as f:
img_res.save(outfile4,'JPEG')
ImageCrop()

Related

Having Problems with extracting only a Rect object in PyMuPDF

I tried the solution from this thread here
Read specific region from PDF
Sadly the following example from the thread by user Zach Young doesn't work for me.
import os.path
import fitz
from fitz import Document, Page, Rect
# For visualizing the rects that PyMuPDF uses compared to what you see in the PDF
VISUALIZE = True
input_path = "test.pdf"
doc: Document = fitz.open(input_path)
for i in range(len(doc)):
page: Page = doc[i]
page.clean_contents() # https://pymupdf.readthedocs.io/en/latest/faq.html#misplaced-item-insertions-on-pdf-pages
# Hard-code the rect you need
rect = Rect(0, 0, 100, 100)
if VISUALIZE:
# Draw a red box to visualize the rect's area (text)
page.draw_rect(rect, width=1.5, color=(1, 0, 0))
text = page.get_textbox(rect)
print(text)
if VISUALIZE:
head, tail = os.path.split(input_path)
viz_name = os.path.join(head, "viz_" + tail)
doc.save(viz_name)
but if I set my respective values for the Rect object (which do seem reasonable) and issue
print(rect.is_empty)
it outputs True.
Also it doesn't draw the rectangle as it obviously should. There is obviously no output from
text = page.get_textbox(rect)
But if I just issue
text = page.get_text()
that gives me some correct output.
However I wonder what is the reason that it says that the rect is empty because I would eagerly need it to only extract the text from a certain area.
Thanks
Nothing is delivered in the following case, because none of the characters is fully contained in the box:
Here the output of another page. For the output is this:
#ifnde
#defin
/*

How to detect colored blocks in a PDF file with python (pdfminer, minecart, tabula...)

I am trying to extract quite a few tables from a PDF file. These tables are sort of conveniently "highlighted" with different colors, which makes it easy for eyes to catch (see the example screenshot).
I think it would be good to detect the position/coordinates of those colored blocks, and use the coordinates to extract tables.
I have figured out the table extraction part (using tabula-py). So it is the first step stopping me. From what I gathered minecart is the best tool for color and shapes in PDF files, except full scale imaging processing with OpenCV. But I have no luck with detecting colored box/block coordinates.
Would appreciate any help!!
I think I got a solution:
import minecart
pdffile = open(fn, 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(page_num) # page_num is 0-based
for shape in page.shapes.iter_in_bbox((0, 0, 612, 792 )):
if shape.fill:
shape_bbox = shape.get_bbox()
shape_color = shape.fill.color.as_rgb()
print(shape_bbox, shape_color)
I would then need to filter the color or the shape size...
My earlier failure was due to having used a wrong page number :(
PyMuPDF lets you extract so-called "line art": the vector drawings on a page.
This is a list of dictionaries of "paths" (as PDF calls interconnected drawings) from which you can sub-select ones of interest for you.
E.g. the following identifies drawings that represent filled rectangles, not too small:
page = doc[0] # load some page (here page 0)
paths = page.get_drawings() # extract all vector graphics
filled_rects = [] # filled rectangles without border land here
for path in paths:
if path["type"] != "f" # only consider paths with a fill color
continue
rect = path["rect"]
if rect.width < 20 or rect.height < 20: # only consider sizable rects
continue
filled_rects.append(rect) # hopefully an area coloring a table
# make a visible border around the hits to see success:
for rect in filled_rects:
page.draw_rect(rect, color=fitz.pdfcolor["red"])
doc.save("debug.pdf")

Writing Images to a PDF through a for loop, images are getting out of order

Script Im showing is a little simpler then what Im doing, but its still does the same problem as the original
What Im trying to do is take screenshots of a webpage, go to the next page and take a screenshot then compile it into a PDF, and i was trying to keep from having to write the image to my hard drive then write it to the PDF, then delete the image off my hard drive.
Now the images are sometimes out of order, and Im assuming its to do with not being able to write the image to the PDF fast enough, or order of operations is being screwed up. Im fairly new so im not sure. Ive tried changing the wait time, I tried saving all the images to a dictionary and doing a loop through the dictionary and it made it worse was repeating the same 4 images. I just know im not doing it right. Any suggestions.
from PIL import ImageGrab, Image
from fpdf import FPDF
from win32api import GetSystemMetrics
import io
import time
from fpdf import FPDF
import keyboard
pdf_width = GetSystemMetrics(0)
pdf_height = GetSystemMetrics(1)
t = 10
pdf = FPDF(unit = "pt", format = [pdf_width,pdf_height])
for x in range(0, t):
img = ImageGrab.grab()
print(x)
mem_file = io.BytesIO()
img.save(mem_file, 'JPEG')
data = mem_file.getvalue()
my_img = Image.open(io.BytesIO(data))
pdf.add_page()
pdf.image(my_img,0,0)
keyboard.press_and_release('right')
time.sleep(2)
if x == (t-1):
pdf.output('test.pdf', 'F')
example pic
pics marked 1-3 are in order pic 4 is same as pic 2 though and not correct
Figures after hours of trying things and reading i made a post, walked away for a few a bi tand figured it out with in 15 mins when i came back. I tried to use a dict and failed this morning so i didnt think a list would work.
t = int(page_count)
time.sleep(1)
pdf = FPDF(unit = "pt", format = [pdf_width,pdf_height])
for x in range(0, t):
string.append(ImageGrab.grab())
print(x)
mem_file = io.BytesIO()
string[x].save(mem_file, 'JPEG')
data = mem_file.getvalue()
string[x] = Image.open(io.BytesIO(data))
pdf.add_page()
pdf.image((string[x]),0,0)
keyboard.press_and_release('right')
ispage_Loading(driver)
time.sleep(WAIT)
if x == (t-1):
pdf.output(dir_path+'\\'+book_name+'.pdf', 'F')
winsound.PlaySound("*", winsound.SND_ALIAS)
explore(dir_path)

How to save output of darknet YOLOv4 video in a txt file for each frame?

I am using darknet to detect objects with YOLOv4 on my custom made dataset. For this detection on videos I use:
./darknet detector demo data/obj.data yolo-obj.cfg yolo-obj_best.weights -ext_output video.mp4 -out-filename video_results.mp4
This gives my the video with the bounding boxes printed for every detection. However, I want to create a .txt (or .csv) file with for each frame number the prediction(s).
I did find this answer, but this gives the output in a json file and I need a .txt or .csv file. I am not so familiar with C so I find it hard to modify this answer into the format I need.
there's already an explanation on how to use the command line, especially to save result in .txt format, the link:
https://github.com/AlexeyAB/darknet#how-to-use-on-the-command-line
to save time, i will provide the point which is might be helpful:
To process a list of images data/train.txt and save results of detection to result.txt use:
darknet.exe detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights -dont_show -ext_output < data/train.txt > result.txt
might be late but maybe helpful to the others.
I followed the suggestion by Rafael and wrote a some code to move from JSON to cvs. I'll put it here in case anyone wants to use it. This is for the case in which a video was analyzed, so each "image" is a frame in a video.
import json
import csv
# with and height of the video
WIDTH = 1920
HEIGHT = 1080
with open('~/detection_results.json', encoding='latin-1') as json_file:
data = json.load(json_file)
# open csv file
csv_file_to_make = open('~/detection_results.csv', 'w', newline='\n')
csv_file = csv.writer(csv_file_to_make)
# write the header
# NB x and y values are relative
csv_file.writerow(['Frame ID',
'class',
'x_center',
'y_center',
'bb_width',
'bb_heigth',
'confidence'])
for frame in data:
frame_id = frame['frame_id']
instrument = ""
center_x = ""
center_y = ""
bb_width = ""
bb_height = ""
confidence = ""
if frame['objects'] == []:
csv_file.writerow([frame_id,
class,
center_x,
center_y,
bb_width,
bb_height,
confidence
])
else:
for single_detection in frame['objects']:
instrument = single_detection['name']
center_x = WIDTH*single_detection['relative_coordinates']['center_x']
center_y = HEIGHT*single_detection['relative_coordinates']['center_y']
bb_width = WIDTH*single_detection['relative_coordinates']['width']
bb_height = HEIGHT*single_detection['relative_coordinates']['height']
confidence = single_detection['confidence']
csv_file.writerow([frame_id,
class,
center_x,
center_y,
bb_width,
bb_height,
confidence
])
csv_file_to_make.close()
Hope this helps! If you see a solution to optimize this code that's also welcome of course :)

Get DPI information of PDF image Python

I have pdf file in which an image is embedded in it how i can get the DPI information of that particular image using python.
i tried using "pdfimages" popler-util it gives me the height and width in pixels.
But how i can get the DPI of image from that.
Like the PostScript format or the EPS format, a PDF file has no resolution because it is a vectorial format. All you can do is retrieving the image dimensions in pt (or pixels):
from PyPDF2 import PdfFileReader
with io.open(path, mode="rb") as f:
input_pdf = PdfFileReader(f)
media_box = input_pdf.getPage(0).mediaBox
min_pt = media_box.lowerLeft
max_pt = media_box.upperRight
pdf_width = max_pt[0] - min_pt[0]
pdf_height = max_pt[1] - min_pt[1]

Categories

Resources