Fetching, Rendering and Positioning images in a PDF file using Pyhon - python

I have a folder where i have stored multiple .bmp images, which I want to place in a .pdf file in a structured manner. Can anyone assist.
Say I have 100 .bmp images in a folder named as image1, image2....image100.
I want to display these images in the PDF file as mentioned below, say 5 lines per PDF page.
-image1 and image2 placed side by side in line 1
-image3 and image4 placed side by side in line 2
-image5 and image6 placed side by side in line 3
...
...
-image99 and image100 placed side by side in line 50
the code should,
Fetch the images from the foleder
Render the size and resolution of these images as per requirement
Position the images in specific location in the PDF

I looked around for the solution and found the answer in FPDF library.
I am sharing the very rough solution i cobbled up.
##install the FPDF liabrary using pip install fpfd
from fpdf import FPDF
def createPDF():
fpdf=FPDF()
##To open a new PDF document
fpdf.add_page()
##set for for the text to be added
fpdf.set_font("Arial", size=10)
##adding text to the pdf document
fpdf.text(22, 10, txt="MASTER")
fpdf.text(57, 10, txt="SAMPLE")
##adding images to the pdf document
fpdf.image("CMYK-M.jpg", 15, 15, w=50)
fpdf.image("CMYK-S.jpg", 50, 15, w=50)
##To save the PDF document
fpdf.output("REPORT001.pdf")

Related

pypdf gives output with incorrect PDF format

I am using the following code to resize pages in a PDF:
from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize
from pypdf.generic import RectangleObject
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
A4_w = PaperSize.A4.width
A4_h = PaperSize.A4.height
# resize page to fit *inside* A4
h = float(page.mediabox.height)
w = float(page.mediabox.width)
scale_factor = min(A4_h/h, A4_w/w)
transform = Transformation().scale(scale_factor,scale_factor).translate(0, A4_h/2 - h*scale_factor/2)
page.add_transformation(transform)
page.cropbox = RectangleObject((0, 0, A4_w, A4_h))
# merge the pages to fit inside A4
# prepare A4 blank page
page_A4 = PageObject.create_blank_page(width = A4_w, height = A4_h)
page.mediabox = page_A4.mediabox
page_A4.merge_page(page)
writer.add_page(page_A4)
writer.write('output.pdf')
Source: https://stackoverflow.com/a/75274841/11501160
While this code works fine for the resizing part, I have found that most input files work fine but some input files do not work fine.
I am providing download links to input.pdf and output.pdf files for testing and review. The output file is completely different from the input file. The images are missing, the background colour is different, even the pure text on first page has only the first line visible.
What is interesting is that these difference are only seen when I open the output pdf in Adobe Acrobat, or look at the physically printed pages.
The PDF looks perfect when i open in Preview (on MacOS) or open the PDF in my Chrome Browser.
and
The origin of the input pdf is that I created it in Preview (on MacOS) by mixing pages from different PDFs and dragging image files into the thumbnails as per these instructions:
https://support.apple.com/en-ca/HT202945
I've never had a problem before while making PDFs like this and even Adobe Acrobat reads the input pdf properly. Only the output pdf is problematic in Acrobat and in printers.
Is this a bug with pypdf or am I doing something wrong ?
How can i get the output PDF to be proper in Adobe Acrobat and printers etc ?
This is a valid bug with pypdf and the fix is due to be released in the next version.
Refer:
https://github.com/py-pdf/pypdf/issues/1607
The following is what PyMuPDF has to offer here. The output displays correctly in all PDF readers:
import fitz # import PyMuPDF
src = fitz.open("input.pdf")
doc = fitz.open()
for i in range(len(src)):
page = doc.new_page() # this is A4 portrait by default
page.show_pdf_page(page.rect, src, i) # scaling will happen automatically
doc.save("fitz-output.pdf",garbage=3,deflate=True)
The above method show_pdf_page() supports many more options, like selecting sub-rectangles form the source page, rotating it by arbitrary angles, and of course freely select the target page's sub-rectangle to receive the content.

convert jpg file to DXF file using Python

Well I'm stuck in a problem of converting my jpg file into dxf file in Python. After dxf file is created I want to open it in MS Visio which is drawing tool for making various types of diagrams.
I've found a code in Python which is used to convert jpg into dxf file but the dxf file which is being created is empty. The image which was present in jpg was not been shown in dxf when I open dxf file in any online dxf viewer.
Here is my code below:
import ezdxf
doc = ezdxf.new("R2000")
img_file="n1.jpeg"
my_image_def = doc.add_image_def(filename=img_file, size_in_pixel=(930, 2500))
msp = doc.modelspace()
msp.add_image(
insert=(2, 1),
size_in_units=(6.4, 3.6),
image_def=my_image_def,
rotation=0)
msp.add_image(
insert=(4, 5),
size_in_units=(3.2, 1.8),
image_def=my_image_def,
rotation=30)
image_defs = doc.objects.query("IMAGEDEF")
doc.saveas("dxf.dxf")
Input image in jpg format
Kindly if anybody sort this problem out and make the image in dxf file visible.
Thanks!
The above image I uploaded as a test image, I want that image to be converted into DXF file in python. Anybody can help....!

I have a folder full of pdfs I am wanting to create a code that spits out a list of all pdfs that contain the color blue

Like the title says, a bunch of pdfs that need to be gone through and a list made showing the pdfs that have the color blue in them.
I tried using a snippet of code from another post that is similar to try and get a list of colors from one document thinking if I could create a loop to go through all documents and export the output to excel and filter for a specific color, that might work, but I cant even get it to work for a single pdf:
#!/usr/bin/env python
# -*- Encoding: UTF-8 -*-
import minecart
colors = set()
with open("F://Prints/0-25162.PDF", "rb") as file:
document = minecart.Document(file)
page = document.get_page(1)
for shape in page.shapes:
if shape.outline:
colors.add(shape.outline.color.as_rgb())
for color in colors: print (color)
Any help or direction would be appreciated.
I would try to render the PDF into PNG or similar bitmap format, then load it as a Python pixel array (using Pillow or similar), and look for blue pixels. Not sure which library you'd use for the rasterizing, but Pillow or pdf2image might do the job. Alternatively, you can do it with ImageMagick prior to the Python processing.

Extract text from PDF files and preserve the orginal layout, in Python

I want to extract text from the PDF files but the layout of text in the PDF should be maintained, like the images below. Images show results from the [github.com/JonathanLink/PDFLayoutTextStripper].
I tried the below code but it doesn't maintain the Layout. I want get results exactly the same way as shown in the images by using any of the Python libraries like PyPDF2, PDFPlumber, PDFminer etc. I tried all these libraries but didn't get the desired results. I need help in extracting the text from the PDF file exactly as is shown in the images.
from pdfminer.high_level import extract_text`
text = extract_text('test.pdf')
print(text)
You can preserve layout/indentation using PDFtotext package.
import pdftotext
with open("target_file.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# All pages
for text in pdf:
print(text)

Extract images from word document using Python

How can i extract images/logo from word document using python and store them in a folder. Following code converts docx to html but it doesn't extract images from the html. Any pointer/suggestion will be of great help.
profile_path = <file path>
result=mammoth.convert_to_html( profile_path)
f = open(profile_path, 'rb')
b = open(profile_html, 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()
You can use the docx2txt library, it will read your .docx document and export images to a directory you specify (must exist).
!pip install docx2txt
import docx2txt
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')
After execution you will have the images in /home/example/img/ and the variable text will have the document text. They would be named image1.png ... imageN.png in order of appearance.
Note: Word document must be in .docx format.
Extract all the images in a docx file using python
1. Using docxtxt
import docx2txt
#extract text
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")
2. Using aspose
import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
shape = shape.as_shape()
if (shape.has_image) :
# set image file's name
imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
# save image
shape.image_data.save(imageFileName)
imageIndex += 1
Native without any lib
To extract the source Images from the docx (which is a variation on a zip file) without distortion or conversion.
shell out to OS and run
tar -m -xf DocxWithImages.docx word/media
You will find the source images Jpeg, PNG WMF or others in the word media folder extracted into a folder of that name. These are the unadulterated source embedment's without scale or crop.
You may be surprised that the visible area may be larger then any cropped version used in the docx itself, and thus need to be aware that Word does not always crop images as expected (A source of embarrassing redaction failure)
Look the Alderven's answer at Extract all the images in a docx file using python
The zipfile works for more image formats than the docx2txt. For example, EMF images are not extracted by docx2txt but can be extracted by zipfile.

Categories

Resources