Python PyMuPDF Fitz insertImage - python

Have been trying to put an image into a PDF file using PyMuPDF / Fitz and everywhere I look on the internet I get the same syntax, but when I use it I'm getting a runtime error.
>>> doc = fitz.open("NewPDF.pdf")
>>> page = doc[1]
>>> rect = fitz.Rect(0,0,880,1080)
>>> page.insertImage(rect, filename = "Image01.jpg")
error: object is not a stream
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\fitz\fitz.py", line 1225, in insertImage
return _fitz.Page_insertImage(self, rect, filename, pixmap, overlay)
RuntimeError: object is not a stream
>>> page
page 1 of NewPDF.pdf
I've tried a few different variations on this, with pixmap and without, with overlay value set, and without. The PDF file exists and can be opened with Adobe Acrobat Reader, and the image file exists - I have tried PNG and JPG.
Thank you in advanced for any help.

just some hints to attempt:
Ensure that your "Image01.jpg" file is open and use the full path.
image_path = "/full/path/to/Image01.jpg"
image_file = Image.open(
open(image_path, 'rb'))
# side-note: generally it is better to use the open with syntax, see link below
# https://stackoverflow.com/questions/9282967/how-to-open-a-file-using-the-open-with-statement
To ensure that you are actually on the pdf page that you expect to be, try this. This code will insert the image only on the first page
for page in doc:
page.InsertImage(rect, filename=image_path)
break # Without this, the image will appear on each page of your pdf

Related

Simple PyPDF exercise - AttributeError: 'NullObject' object has no attribute 'get'

Working on a simple PyPDF related exercise - I basically need to take a PDF file and apply a watermark to to it.
Here's my code:
# We need to build a program that will watermark all of our PDF files
# Use the wtr.pdf and apply it to all of the pages of our PDF file
import PyPDF2
# Open the file we want to add the watermark to
with open("combined.pdf", mode="rb") as file:
reader = PyPDF2.PdfFileReader(file)
# Open the watermark file and get the watermark
with open("wtr.pdf", mode="rb") as watermark_file:
watermark_reader = PyPDF2.PdfFileReader(watermark_file)
# Create a writer object for the output file
writer = PyPDF2.PdfFileWriter()
for i in range(reader.numPages):
page = reader.getPage(i)
# Merge the watermark page object into our current page
page.mergePage(watermark_reader.getPage(0))
# Append this new page into our writer object
writer.addPage(page)
with open("watermarked.pdf", mode="wb") as output_file:
writer.write(output_file)
I am unclear as to why I get this error:
$ python watermark.py
Traceback (most recent call last):
File "watermark.py", line 20, in <module>
page.mergePage(watermark_reader.getPage(0))
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2239, in mergePage
self._mergePage(page2)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2260, in _mergePage
new, newrename = PageObject._mergeResources(originalResources, page2Resources, res)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2170, in _mergeResources
newRes.update(res1.get(resource, DictionaryObject()).getObject())
AttributeError: 'NullObject' object has no attribute 'get'
I would appreciate any insights. I have been staring at this for a while.
For some reason your pdf file doesn't contain "/Resources". PyPDF2 tries to get it in line 2314 in https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/pdf.py#L2314
You can try another pdf file to check if the error persists. May be it is a bug in the library or the library doesn't support such files.
Another thing I noticed is that line numbers in master branch of the library do not match line numbers in your stack trace, so may be you need to get more recent version of the library and hope that the problem is fixed there.
By briefly looking at pdf file structure it seems that /Resources are optional. If this is a case, then PyPDF2 doesn't handle this case and it should be probably reported as a bug at https://github.com/mstamy2/PyPDF2/issues

Inserting icc color profile into image using Pillow ImageCMS

I am trying to insert an ICC color profile into an image. Using code from this post as an example my code looks like this:
from PIL import Image, ImageCms
# Read image
img = Image.open(IMG_PATH)
# Read profile
profile = ImageCms.getOpenProfile(PROFILE_PATH)
# Save image with profile
img.save(OUT_IMG_PATH, icc_profile=profile)
I get the following error
Traceback (most recent call last):
File "/home/----/Documents/code_projects/hfss-misc/icc_profiles/insert_icc_profile_into_image.py", line 17, in <module>
img.save(OUT_IMG_PATH, icc_profile=srgb_profile)
File "/home/----/.virtualenvs/color-correction/lib/python3.6/site-packages/PIL/Image.py", line 2102, in save
save_handler(self, fp, filename)
File "/home/----/.virtualenvs/color-correction/lib/python3.6/site-packages/PIL/JpegImagePlugin.py", line 706, in _save
markers.append(icc_profile[:MAX_DATA_BYTES_IN_MARKER])
TypeError: 'PIL._imagingcms.CmsProfile' object is not subscriptable
I thought that there might be a problem with my ICC profile so I tried to use one generated by Pillow.
from PIL import Image, ImageCms
# Read image
img = Image.open(IMG_PATH)
# Creating sRGB profile
profile = ImageCms.createProfile("sRGB")
# Save image with profile
img.save(OUT_IMG_PATH, icc_profile=profile)
I still get the same error, however.
Does anyone know what the cause for this error is?
My system environment is as follows:
Ubuntu 18.04
Python 3.6
Pillow==7.0.0
The answer from github at (https://github.com/python-pillow/Pillow/issues/4464) was to use profile.to_bytes():
img.save(OUT_IMG_PATH, icc_profile=profile.tobytes())

Can't read the content of a certain page of a pdf file available online

I've used PyMuPDF library to parse the content of any specific page of a pdf file locally and found it working. However, when I try to apply the same logic while parsing the content of any specific page of a pdf file available online, I encounter an error.
I got success using the following script (local pdf):
import fitz
path = r'C:\Users\WCS\Desktop\pymupdf\Regular Expressions Cookbook.pdf'
doc = fitz.open(path)
page1 = doc.loadPage(5)
page1text = page1.getText("text")
print(page1text)
The script below throws an error (pdf that is available online):
import fitz
import requests
URL = 'https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf'
res = requests.get(URL)
doc = fitz.open(res.content)
page1 = doc.loadPage(5)
page1text = page1.getText("text")
print(page1text)
Error that the script encounters:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\general_demo.py", line 8, in <module>
doc = fitz.open(res.content)
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\fitz\fitz.py", line 2010, in __init__
_fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))
RuntimeError: cannot open b'%PDF-1.5\n%\xd0\xd4\xc5\xd8\n1 0 obj\n<<\n/Length 843 \n/Filter /FlateDecode\n>>\nstream\nx\xdamUMo\xe20\x10\xbd\xe7Wx\x0f\x95\xda\x03\xc5N\xc8W\x85\x90\x9c\x84H\x1c\xb6\xad\nZ\xed\x95&\xa6\x8bT\x12\x14\xe0\xd0\x7f\xbf~3\x13\xda\xae\xf
How can I read the content directly from online?
Looks like you need to initialize the object with stream:
>>> # from memory
>>> doc = fitz.open(stream=mem_area, filetype="pdf")
mem_area has the data of the document.
https://pymupdf.readthedocs.io/en/latest/document.html#Document
I think you were missing the read() function to read file as bytesIO which pymupdf can then consume.
with fitz.open(stream=uploaded_pdf.read(), filetype="pdf") as doc:
text = ""
for page in doc:
text += page.getText()
print(text)

PIL cannot identify image file for a Google Drive image streamd into io.BytesIO

I am using the Drive API to download an image. Following their file downloading documentation in Python, I end up with a variable fh that is a populated io.BytesIO instance. I try to save it as an image:
file_id = "0BwyLGoHzn5uIOHVycFZpSEwycnViUjFYQXR5Nnp6QjBrLXJR"
request = service.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print('Download {} {}%.'.format(file['name'],
int(status.progress() * 100)))
fh.seek(0)
image = Image.open(fh) # error
The error is: cannot identify image file <_io.BytesIO object at 0x106cba890>. Actually, the error does not occur with another image but is thrown with most images, including the one I linked at the beginning of this post.
After reading this answer I change that last line to:
byteImg = fh.read()
dataBytesIO = io.BytesIO(byteImg)
image = Image.open(dataBytesIO) # still the same error
I've also tried this answer, where I change the last line of my first code block to
byteImg = fh.read()
image = Image.open(StringIO(byteImg))
But I still get a cannot identify image file <StringIO.StringIO instance at 0x106471e60> error.
I've tried using alternates (requests, urllib) with no fruition. I can Image.open the the image if I download it manually.
This error was not present a month ago, and has recently popped up into the application this code is in. I've spent days debugging this error with no success and have finally brought the issue to Stack Overflow. I am using from PIL import Image.
Ditch the Drive service's MediaIOBaseDownload. Instead, use the webContentLink property of a media file (a link for downloading the content of the file in a browser, only available for files with binary content). Read more here.
With that content link, we can use an alternate form of streaming—the requests and shutil libraries and the —to get the image.
import requests
import shutil
r = requests.get(file['webContentLink'], stream=True)
with open('output_file', 'wb') as f:
shutil.copyfileobj(r.raw, f)

Error setting psm for pytesseract

I'm trying to use a psm of 0 with pytesseract, but I'm getting an error. My code is:
import pytesseract
from PIL import Image
img = Image.open('pathToImage')
pytesseract.image_to_string(img, config='-psm 0')
The error that comes up is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 126, in image_to_string
f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory:
'/var/folders/m8/pkg0ppx11m19hwn71cft06jw0000gp/T/tess_uIaw2D.txt'
When I go into '/var/folders/m8/pkg0ppx11m19hwn71cft06jw0000gp/T', there's a file called tess_uIaw2D.osd that seems to contain the output information I was looking for. It seems like tesseract is saving a file as .osd, then looking for that file but with a .txt extension. When I run tesseract through the command line with --psm 0, it saves the output file as .osd instead of .txt.
Is it correct that pytesseract's image_to_string() works by saving an output file somewhere and then automatically reading that output file? And is there any way to either set tesseract to save the file as .txt, or to set it to look for a .osd file? I'm having no issues just running the image_to_string() function when I don't set the psm.
You have a couple of questions here:
PSM error
In your question you mention that you are running "--psm 0" in the command line. However in your code snip you have "-psm 0".
Using the double dash, config= "--psm 0", will fix that issue.
If you read the tesseract command line documentation, you can specify where to output the text read from the image. I suggest you start there.
Is it correct that pytesseract's image_to_string() works by saving an output file somewhere and then automatically reading that output file?
From my usage of tesseract, this is not how it works
pytesseract.image_to_string() by default returns the string found on the image. This is defined by the parameter output_type=Output.STRING, when you look at the function image_to_string.
The other return options include (1) Output.BYTES and (2) Output.DICT
I usually have something like text = pytesseract.image_to_string(img)
I then write that text to a log file
Here is an example:
import datetime
import io
import pytesseract
import cv2
img = cv2.imread("pathToImage")
text = pytesseract.image_to_string(img, config="--psm 0")
ocr_log = "C:/foo/bar/output.txt"
timestamp_fmt = "%Y-%m-%d_%H-%M-%S-%f"
# ...
# DO SOME OTHER STUFF BEFORE WRITING TO LOG FILE
# ...
with io.open(ocr_log, "a") as ocr_file:
timestamp = datetime.datetime.now().strftime(timestamp_fmt)
ocr_file.write(f"{timestamp}:\n====OCR-START===\n")
ocr_file.write(text)
ocr_file.write("\n====OCR-END====\n")

Categories

Resources