Check file before calling SimpleITK.SimpleITK.ImageFileReader.ReadImageInformation() - python

I am processing a set of DICOM files, some of which have image information and some of which don't. If a file has image information, the following code works fine.
file_reader = sitk.ImageFileReader()
file_reader.SetFileName(fileName)
file_reader.ReadImageInformation()
However, if the file does not have image information, I get the following error.
Traceback (most recent call last):
File "<ipython-input-61-d187aed107ed>", line 5, in <module>
file_reader.ReadImageInformation()
File "/home/peter/anaconda3/lib/python3.7/site-packages/SimpleITK/SimpleITK.py", line 8673, in ReadImageInformation
return _SimpleITK.ImageFileReader_ReadImageInformation(self)
RuntimeError: Exception thrown in SimpleITK ImageFileReader_ReadImageInformation: /tmp/SimpleITK/Code/IO/src/sitkImageReaderBase.cxx:107:
sitk::ERROR: Unable to determine ImageIO reader for "/path/115.dcm"
If the DICOM file has no information, I would like to just ignore the file rather than calling ReadImageInformation(). Is there a way to check whether ReadImageInformation() will work before it is called? I tried the following and they are no different between files where ReadImageInformation() and files where it does not.
file_reader.GetImageIO()
file_reader.GetMetaDataKeys() # Crashes
file_reader.GetDimension()

I would just put an exception handler around it to catch the error. So it'd look something like this:
file_reader = sitk.ImageFileReader()
file_reader.SetFileName(fileName)
try:
file_reader.ReadImageInformation()
except:
print(fileName, "has no image information")

Related

Simple PyPDF exercise - AttributeError: 'NullObject' object has no attribute 'get'

Working on a simple PyPDF related exercise - I basically need to take a PDF file and apply a watermark to to it.
Here's my code:
# We need to build a program that will watermark all of our PDF files
# Use the wtr.pdf and apply it to all of the pages of our PDF file
import PyPDF2
# Open the file we want to add the watermark to
with open("combined.pdf", mode="rb") as file:
reader = PyPDF2.PdfFileReader(file)
# Open the watermark file and get the watermark
with open("wtr.pdf", mode="rb") as watermark_file:
watermark_reader = PyPDF2.PdfFileReader(watermark_file)
# Create a writer object for the output file
writer = PyPDF2.PdfFileWriter()
for i in range(reader.numPages):
page = reader.getPage(i)
# Merge the watermark page object into our current page
page.mergePage(watermark_reader.getPage(0))
# Append this new page into our writer object
writer.addPage(page)
with open("watermarked.pdf", mode="wb") as output_file:
writer.write(output_file)
I am unclear as to why I get this error:
$ python watermark.py
Traceback (most recent call last):
File "watermark.py", line 20, in <module>
page.mergePage(watermark_reader.getPage(0))
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2239, in mergePage
self._mergePage(page2)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2260, in _mergePage
new, newrename = PageObject._mergeResources(originalResources, page2Resources, res)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2170, in _mergeResources
newRes.update(res1.get(resource, DictionaryObject()).getObject())
AttributeError: 'NullObject' object has no attribute 'get'
I would appreciate any insights. I have been staring at this for a while.
For some reason your pdf file doesn't contain "/Resources". PyPDF2 tries to get it in line 2314 in https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/pdf.py#L2314
You can try another pdf file to check if the error persists. May be it is a bug in the library or the library doesn't support such files.
Another thing I noticed is that line numbers in master branch of the library do not match line numbers in your stack trace, so may be you need to get more recent version of the library and hope that the problem is fixed there.
By briefly looking at pdf file structure it seems that /Resources are optional. If this is a case, then PyPDF2 doesn't handle this case and it should be probably reported as a bug at https://github.com/mstamy2/PyPDF2/issues

python-pptx: Dealing with password-protected PowerPoint files

I'm using a slightly modified version of the "Extract all text from slides in presentation" example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html to extract text from some PowerPoint slides.
I'm getting a PackageNotFoundError when I try to use the Presentation() method to open some of the PowerPoint files to read the text.
This appears to be due to the fact that, unbeknownst to me before I started this project, a few of the PowerPoint files are password protected.
I obviously don't expect to be able to read text from a password-protected file but is there a recommended way of dealing with password-protected PowerPoint files? Having my Python script die every time it runs into one is annoying.
I'd be fine with something that basically went: "Hi! The file you're trying to read may be password-protected. Skipping."
I tried using a try/except block to catch the PackageNotFoundError but then I got "NameError: name 'PackageNotFoundError' is not defined".
EDIT1: Here's a minimal case the generates the error:
EDIT2: See below for a working try/catch block, thanks to TheGamer007's suggestion.
import pptx
from pptx import Presentation
password_protected_file = r"C:\Users\J69401\Documents\password_protected_file.pptx"
prs = Presentation(password_protected_file)
And here's the error that is generated:
Traceback (most recent call last):
File "T:/W/Wintermute/50 Sandbox/Pownall/Python/copy files/minimal_case_opening_file.py", line 6, in <module>
prs = Presentation(password_protected_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\api.py", line 28, in Presentation
presentation_part = Package.open(pptx).main_document_part
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\package.py", line 125, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\pkgreader.py", line 33, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\phys_pkg.py", line 32, in __new__
raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at 'C:\Users\J69401\Documents\password_protected_file.pptx'
Here's the minimal case again but with a working try/catch block.
import pptx
from pptx import Presentation
import pptx.exc
from pptx.exc import PackageNotFoundError
password_protected_file = r"C:\Users\J69401\Documents\password_protected_file.pptx"
try:
prs = Presentation(password_protected_file)
except PackageNotFoundError:
print("PackageNotFoundError generated - possible password-protected file.")

Is there a method to write metadata to an image in PyExifTool?

I used ExifTool's '-tagsFromFile' to copy Exif metadata from a source image to a destination image from the command line. Wanted to do the same in a python script, which is when I got to know of PyExifTool. However, I didn't find any command to copy or write to a destination image. Am I missing something? Is there a way I can fix this?
I found user5008949's answer to a similar question which suggested doing this:
import exiftool
filename = '/home/radha/src.JPG'
with exiftool.ExifTool() as et:
et.execute("-tagsFromFile", filename , "dst.JPG")
However, it gives me the following error:
Traceback (most recent call last):
File "metadata.py", line 9, in <module>
et.execute("-tagsFromFile", filename , "dst.JPG")
File "/home/radha/venv/lib/python3.6/site-packages/exiftool.py", line 221, in execute
self._process.stdin.write(b"\n".join(params + (b"-execute\n",)))
TypeError: sequence item 0: expected a bytes-like object, str found
execute() method requires bytes as inputs and you are passing strings. That is why it fails.
In your case, the code should look like this:
import exiftool
filename = b"/home/radha/src.JPG"
with exiftool.ExifTool() as et:
et.execute(b"-tagsFromFile", filename , b"dst.JPG")
Please find this answer as a refernece.

Error setting psm for pytesseract

I'm trying to use a psm of 0 with pytesseract, but I'm getting an error. My code is:
import pytesseract
from PIL import Image
img = Image.open('pathToImage')
pytesseract.image_to_string(img, config='-psm 0')
The error that comes up is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 126, in image_to_string
f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory:
'/var/folders/m8/pkg0ppx11m19hwn71cft06jw0000gp/T/tess_uIaw2D.txt'
When I go into '/var/folders/m8/pkg0ppx11m19hwn71cft06jw0000gp/T', there's a file called tess_uIaw2D.osd that seems to contain the output information I was looking for. It seems like tesseract is saving a file as .osd, then looking for that file but with a .txt extension. When I run tesseract through the command line with --psm 0, it saves the output file as .osd instead of .txt.
Is it correct that pytesseract's image_to_string() works by saving an output file somewhere and then automatically reading that output file? And is there any way to either set tesseract to save the file as .txt, or to set it to look for a .osd file? I'm having no issues just running the image_to_string() function when I don't set the psm.
You have a couple of questions here:
PSM error
In your question you mention that you are running "--psm 0" in the command line. However in your code snip you have "-psm 0".
Using the double dash, config= "--psm 0", will fix that issue.
If you read the tesseract command line documentation, you can specify where to output the text read from the image. I suggest you start there.
Is it correct that pytesseract's image_to_string() works by saving an output file somewhere and then automatically reading that output file?
From my usage of tesseract, this is not how it works
pytesseract.image_to_string() by default returns the string found on the image. This is defined by the parameter output_type=Output.STRING, when you look at the function image_to_string.
The other return options include (1) Output.BYTES and (2) Output.DICT
I usually have something like text = pytesseract.image_to_string(img)
I then write that text to a log file
Here is an example:
import datetime
import io
import pytesseract
import cv2
img = cv2.imread("pathToImage")
text = pytesseract.image_to_string(img, config="--psm 0")
ocr_log = "C:/foo/bar/output.txt"
timestamp_fmt = "%Y-%m-%d_%H-%M-%S-%f"
# ...
# DO SOME OTHER STUFF BEFORE WRITING TO LOG FILE
# ...
with io.open(ocr_log, "a") as ocr_file:
timestamp = datetime.datetime.now().strftime(timestamp_fmt)
ocr_file.write(f"{timestamp}:\n====OCR-START===\n")
ocr_file.write(text)
ocr_file.write("\n====OCR-END====\n")

How to overcome memory issue when sequentially appending files to one another

I am running the following script in order to append files to one another by cycling through months and years if the file exists, I have just tested it with a larger dataset where I would expect the output file to be roughly 600mb in size. However I am running into memory issues. Firstly is this normal to run into memory issues (my pc has 8 gb ram) I am not sure how I am eating all of this memory space?
Code I am running
import datetime, os
import StringIO
stored_data = StringIO.StringIO()
start_year = "2011"
start_month = "November"
first_run = False
current_month = datetime.date.today().replace(day=1)
possible_month = datetime.datetime.strptime('%s %s' % (start_month, start_year), '%B %Y').date()
while possible_month <= current_month:
csv_filename = possible_month.strftime('%B %Y') + ' MRG.csv'
if os.path.exists(csv_filename):
with open(csv_filename, 'rb') as current_csv:
if first_run != False:
next(current_csv)
else:
first_run = True
stored_data.writelines(current_csv)
possible_month = (possible_month + datetime.timedelta(days=31)).replace(day=1)
if stored_data:
contents = stored_data.getvalue()
with open('FullMergedData.csv', 'wb') as output_csv:
output_csv.write(contents)
The trackback I receive:
Traceback (most recent call last):
File "C:\code snippets\FullMerger.py", line 23, in <module>
contents = stored_output.getvalue()
File "C:\Python27\lib\StringIO.py", line 271, in getvalue
self.buf += ''.join(self.buflist)
MemoryError
Any ideas how to achieve a work around or make this code more efficient to overcome this issue. Many thanks
AEA
Edit1
Upon running the code supplied alKid I received the following traceback.
Traceback (most recent call last):
File "C:\FullMerger.py", line 22, in <module>
output_csv.writeline(line)
AttributeError: 'file' object has no attribute 'writeline'
I fixed the above by changing it to writelines however I still received the following trace back.
Traceback (most recent call last):
File "C:\FullMerger.py", line 19, in <module>
next(current_csv)
StopIteration
In stored_data, you're trying to store the whole file, and since it's too large, you're getting the error you are showing.
One solution is to write the file per line. It is far more memory-efficient, since you only store a line of data in the buffer, instead of the whole 600 MB.
In short, the structure can be something this:
with open('FullMergedData.csv', 'a') as output_csv: #this will append
# the result to the file.
with open(csv_filename, 'rb') as current_csv:
for line in current_csv: #loop through the lines
if first_run != False:
next(current_csv)
first_run = True #After the first line,
#you should immidiately change first_run to true.
output_csv.writelines(line) #write it per line
Should fix your problem. Hope this helps!
Your memory error is because you store all the data in a buffer before writing it. Consider using something like copyfileobj to directly copy from one open file object to another, this will only buffer small amounts of data at a time. You could also do it line by line, which will have much the same effect.
Update
Using copyfileobj should be much faster than writing the file line by line. Here is an example of how to use copyfileobj. This code opens two files, skips the first line of the input file if skip_first_line is True and then copies the rest of that file to the output file.
skip_first_line = True
with open('FullMergedData.csv', 'a') as output_csv:
with open(csv_filename, 'rb') as current_csv:
if skip_first_line:
current_csv.readline()
shutil.copyfileobj(current_csv, output_csv)
Notice that if you're using copyfileobj you'll want to use current_csv.readline() instead of next(current_csv). That's because iterating over a file object buffers part of the file, which is normally very useful, but you don't want that in this case. More on that here.

Categories

Resources