PdfFileWriter doesn't work because of content including Chineses character - python

I was trying to create a code to generate a combined pdf from a bunch of small pdf files while I found the script failing with UnicodeEncodeError error.
I also tried to include encoding param by
with open("Combined.pdf", "w",encoding='utf-8-sig') as outputStream:
but compiler said it needs to be binary 'wb' mode. So this isn't working.
Below is the code:
writer = PdfFileWriter()
input_stream = []
for f2 in f_re:
inputf_file = str(mypath+'\\'+f2[2])
input_stream.append(open(inputf_file,'rb'))
for reader in map(PdfFileReader, input_stream):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
with open("Combined.pdf", "wb") as outputStream:
writer.write(outputStream)
writer.save()
for f in input_stream:
f.close()
Below is error message:
Traceback (most recent call last):
File "\Workspace\Python\py_CombinPDF\py_combinePDF.py", line 89, in <module>
writer.write(outputStream)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\pdf.py", line 501, in write
obj.writeToStream(stream, key)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\generic.py", line 549, in writeToStream
value.writeToStream(stream, encryption_key)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\generic.py", line 472, in writeToStream
stream.write(b_(self))
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\utils.py", line 238, in b_
r = s.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-9: ordinal not in range(256)

Upgrading PyPDF2 solved this issue.
Now, 4 years later, people should use pypdf. It contains the latest code (I'm the maintainer of PyPDF2 and pypdf)

Related

How to extract text data from a multi page CV in a PDF format using pyPDF2?

I extracted text content from a multi page CV in a PDF format and trying to write that content in to a text file using pyPDF2. But I'm getting the following error message when trying to write the content.
Here is my code:
import PyPDF2
newFile = open('details.txt', 'w')
file = open("cv3.pdf", 'rb')
pdfreader = PyPDF2.PdfFileReader(file)
numPages = pdfreader.getNumPages()
print(numPages)
page_content = ""
for page_number in range(numPages):
page = pdfreader.getPage(page_number)
page_content += page.extractText()
newFile.write(page_content)
print(page_content)
file.close()
newFile.close()
The error message:
Traceback (most recent call last): File
"C:/Users/HP/PycharmProjects/CVParser/pdf.py", line 16, in
newFile.write(page_content) File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u0141' in
position 827: character maps to
Process finished with exit code 1
This code was succeeded with the PDF file (docx file which converted in to a PDF) with multi pages.
Please help me if any one know the solution.
This will solve your problem in Python 3:
with open("Output.txt", "w") as text_file:
print("{}".format(page_content), file=text_file)
If above is not working for you somehow, the try below:
with open("Output1.txt", "wb") as text_file:
text_file.write(page_content.encode("UTF-8"))

How to convert binary file into readable format on linux server

I am trying to convert binary file into readable format but unable to do so, please suggest how it could be achieved.
$ file test.docx
test.docx: Microsoft Word 2007+
$ file -i test.docx
test.docx: application/msword; charset=binary
$
>>> raw = codecs.open('test.docx', encoding='ascii').readlines()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/Python/installPath/lib/python2.7/codecs.py", line 694, in readlines
return self.reader.readlines(sizehint)
File "/home/Python/installPath/lib/python2.7/codecs.py", line 603, in readlines
data = self.read()
File "/home/Python/installPath/lib/python2.7/codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 18: ordinal not in range(128)
Try the below code, Working with Binary Data
with open("test_file.docx", "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
print(data)
# Seek position and read N bytes
binary_file.seek(0) # Go to beginning
couple_bytes = binary_file.read(2)
print(couple_bytes)
you'll have to read it in binary mode :
import binascii
with open('test.docx', 'rb') as f: # 'rb' stands for read binary
hexdata = binascii.hexlify(f.read()) # convert to hex
print(hexdata)
I think others have not answered this question - at least the part as #ankitpandey clarified in his comment about catdoc returning an error
" catdoc then error is This file looks like ZIP archive or Office 2007
or later file. Not supported by catdoc"
I too had just encountered this same issue with catdoc, had found this solution that worked for me
the .zip archive mention was a clue - and I was able to the following command
unzip -q -c 'test.docx' word/document.xml | python etree.py
to extract the text portion of test.docx to stdout
the python code was placed in etree.py
from lxml import etree
import sys
xml = sys.stdin.read().encode('utf-8')
root = etree.fromstring(xml)
bits_of_text = root.xpath('//text()')
# print(bits_of_text) # Note that some bits are whitespace-only
joined_text = ' '.join(
bit.strip() for bit in bits_of_text
if bit.strip() != ''
)
print(joined_text)

Read compressed stdin

I would like to have such call:
pv -ptebar compressed.csv.gz | python my_script.py
Inside my_script.py I would like to decompress compressed.csv.gz and parse it using Python csv parser. I would expect something like this:
import csv
import gzip
import sys
with gzip.open(fileobj=sys.stdin, mode='rt') as f:
reader = csv.reader(f)
print(next(reader))
print(next(reader))
print(next(reader))
Of course it doesn't work because gzip.open doesn't have fileobj argument. Could you provide some working example solving this issue?
UPDATE
Traceback (most recent call last):
File "my_script.py", line 8, in <module>
print(next(reader))
File "/usr/lib/python3.5/gzip.py", line 287, in read1
return self._buffer.read1(size)
File "/usr/lib/python3.5/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.5/gzip.py", line 461, in read
if not self._read_gzip_header():
File "/usr/lib/python3.5/gzip.py", line 404, in _read_gzip_header
magic = self._fp.read(2)
File "/usr/lib/python3.5/gzip.py", line 91, in read
self.file.read(size-self._length+read)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The traceback above appeared after applying #Rawing advice.
In python 3.3+, you can pass a file object to gzip.open:
The filename argument can be an actual filename (a str or bytes object), or an existing file object to read from or write to.
So your code should work if you just omit the fileobj=:
with gzip.open(sys.stdin, mode='rt') as f:
Or, a slightly more efficient solution:
with gzip.open(sys.stdin.buffer, mode='rb') as f:
If for some odd reason you're using a python older than 3.3, you can directly invoke the gzip.GzipFile constructor. However, these old versions of the gzip module didn't have support for files opened in text mode, so we'll use sys.stdin's underlying buffer instead:
with gzip.GzipFile(fileobj=sys.stdin.buffer) as f:
Using gzip.open(sys.stdin.buffer, 'rt') fixes issue for Python 3.

I get python frameworks error while reading a csv file, when I try a different easier file it works fine

import csv
exampleFile = open('example.csv')
exampleReader = csv.reader(exampleFile)
for row in exampleReader:
print('Row #' + str(exampleReader.line_num) + ' ' + str(row))
Traceback (most recent call last):
File "/Users/jossan113/Documents/Python II/test.py", line 7, in <module>
for row in exampleReader:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 4627: ordinal not in range(128)
Do anyone have any idea why I get this error? I tried an very easy cvs file from the internet and it worked just fine, but when I try the bigger file it doesn't
The file contains unicode characters, which was painful to deal with in old versions of python, since you are using 3.5 try opening the file as utf-8 and see if the issue goes away:
exampleFile = open('example.csv', encoding="utf-8")
From the docs:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
csv modeule docs

UnicodeDecodeError message in python

I am relatively new to Python programming. I am using Python 3.3.2 on Windows XP.
My program was working and then all of a sudden I got a UnicodeDecodeError error message.
The exec.py file looks like this:
import re
import os,shutil
f=open("C:/Documents and Settings/hp/Desktop/my_python_files/AU20-10297-2_yield_69p4_11fails_2_10_14python/a1.txt","a")
for r,d,fi in os.walk("C:/Documents and Settings/hp/Desktop/my_python_files/AU20-10297-2_yield_69p4_11fails_2_10_14python"):
for files in fi:
if files.startswith("data"):
g=open(os.path.join(r,files))
shutil.copyfileobj(g,f)
g.close()
f.close()
keywords = ['FAIL']
pattern = re.compile('|'.join(keywords))
inFile = open("a1.txt")
outFile =open("failure_info", "w")
keepCurrentSet = False
for line in inFile:
if line.startswith(" Test Results"):
keepCurrentSet = False
if keepCurrentSet:
outFile.write(line)
if line.startswith("Station ID "):
keepCurrentSet = True
#if 'FAIL' in line in inFile:
# outFile.write(line)
if pattern.search(line):
outFile.write(line)
inFile.close()
outFile.close()
Now, a1.txt is initially an empty seed text file used for collecting data from the data files.
I got the following error messages:
Traceback (most recent call last):
File "C:\Documents and Settings\hp\Desktop\my_python_files\AU20-10297-2_yield_69p4_11fails_2_10_14python\exec.py", line 8, in <module>
shutil.copyfileobj(g,f)
File "C:\Python33\lib\shutil.py", line 68, in copyfileobj
buf = fsrc.read(length)
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 754: character maps to <undefined>
Can anyone help me fix the code so it is more robust?
You have opened the files in text mode, which means Python will try and decode the contents to Unicode. You'd normally need to specify the correct codec for the file (or Python will use your platform default), but are just copying files across with shutil.copyfileobj() here, and decoding is not needed.
Open the files in binary mode instead.
f = open(..., 'ab')
for r,d,fi in os.walk(...):
for files in fi:
if files.startswith("data"):
g = open(os.path.join(r, files), 'rb')
shutil.copyfileobj(g,f)
Note the addition of the b to the filemode.
You probably want to use the file objects as context managers so they are closed for you, automatically:
with open(..., 'ab') as outfh:
for r,d,fi in os.walk(...):
for files in fi:
if files.startswith("data"):
with open(os.path.join(r, files), 'rb') as infh:
shutil.copyfileobj(infh, outfh)

Categories

Resources