UnicodeDecodeError message in python - python

I am relatively new to Python programming. I am using Python 3.3.2 on Windows XP.
My program was working and then all of a sudden I got a UnicodeDecodeError error message.
The exec.py file looks like this:
import re
import os,shutil
f=open("C:/Documents and Settings/hp/Desktop/my_python_files/AU20-10297-2_yield_69p4_11fails_2_10_14python/a1.txt","a")
for r,d,fi in os.walk("C:/Documents and Settings/hp/Desktop/my_python_files/AU20-10297-2_yield_69p4_11fails_2_10_14python"):
for files in fi:
if files.startswith("data"):
g=open(os.path.join(r,files))
shutil.copyfileobj(g,f)
g.close()
f.close()
keywords = ['FAIL']
pattern = re.compile('|'.join(keywords))
inFile = open("a1.txt")
outFile =open("failure_info", "w")
keepCurrentSet = False
for line in inFile:
if line.startswith(" Test Results"):
keepCurrentSet = False
if keepCurrentSet:
outFile.write(line)
if line.startswith("Station ID "):
keepCurrentSet = True
#if 'FAIL' in line in inFile:
# outFile.write(line)
if pattern.search(line):
outFile.write(line)
inFile.close()
outFile.close()
Now, a1.txt is initially an empty seed text file used for collecting data from the data files.
I got the following error messages:
Traceback (most recent call last):
File "C:\Documents and Settings\hp\Desktop\my_python_files\AU20-10297-2_yield_69p4_11fails_2_10_14python\exec.py", line 8, in <module>
shutil.copyfileobj(g,f)
File "C:\Python33\lib\shutil.py", line 68, in copyfileobj
buf = fsrc.read(length)
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 754: character maps to <undefined>
Can anyone help me fix the code so it is more robust?

You have opened the files in text mode, which means Python will try and decode the contents to Unicode. You'd normally need to specify the correct codec for the file (or Python will use your platform default), but are just copying files across with shutil.copyfileobj() here, and decoding is not needed.
Open the files in binary mode instead.
f = open(..., 'ab')
for r,d,fi in os.walk(...):
for files in fi:
if files.startswith("data"):
g = open(os.path.join(r, files), 'rb')
shutil.copyfileobj(g,f)
Note the addition of the b to the filemode.
You probably want to use the file objects as context managers so they are closed for you, automatically:
with open(..., 'ab') as outfh:
for r,d,fi in os.walk(...):
for files in fi:
if files.startswith("data"):
with open(os.path.join(r, files), 'rb') as infh:
shutil.copyfileobj(infh, outfh)

Related

Editing UTF-8 text file on Windows

I'm trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.
This is the code:
input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
new_line = line.replace(" ", "+")
new_line2 = new_line.replace("\t", "+")
out.write(new_line2)
#print(new_line2)
fh.close()
out.close()
It gives me an error:
Traceback (most recent call last):
File "music.py", line 3, in <module>
for line in input:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>
As music.txt is saved in UTF-8, I changed the first line to:
input = open('music.txt', 'r', encoding="utf8")
This gives another error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u039b' in position 21: character maps to <undefined>
I tried other things with the out.write() but it didn't work.
This is the raw data of music.txt.
https://pastebin.com/FVsVinqW
I saved it in windows editor as UTF-8 .txt file.
If your system's default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.
with open('music.txt', 'r', encoding='utf-8') as infh,\
open("out.txt", "w", encoding='utf-8') as outfh:
for line in infh:
line = line.replace(" ", "+").replace("\t", "+")
outfh.write(line)
This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.
Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.

Read a large text file and write to another file with Python

I am trying to convert a large text file (size of 5 gig+) but got a
From this post, I managed to convert encoding format of a text file into a format that is readable with this:
path ='path/to/file'
des_path = 'path/to/store/file'
for filename in os.listdir(path):
with open('{}/{}'.format(path, filename), 'r+', encoding='iso-8859-11') as f:
t = open('{}/{}'.format(des_path, filename), 'w')
string = f.read()
t.write(string)
t.close()
The problem here is that when I tried to convert a text file with a large size(5 GB+). I will got this error
Traceback (most recent call last):
File "Desktop/convertfile.py", line 12, in <module>
string = f.read()
File "/usr/lib/python3.6/encodings/iso8859_11.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
MemoryError
which I know that it cannot read a file with this large. And I found from several link that I can do it by reading line by line.
So, how can I apply to the code I have to make it read line by line? What I understand about reading line by line here is that I need to read a line from f and add it to t until end of the line, right?
You can iterate on the lines of an open file.
for filename in os.listdir(path):
inp, out = open_files(filename):
for line in inp:
out.write(line)
inp.close(), out.close()
Note that I've hidden the complexity of the different paths, encodings, modes in a function that I suggest you to actually write...
Re buffering, i.e. reading/writing larger chunks of the text, Python does its own buffering undercover so this shouldn't be too slow with respect to a more complex solution.

PdfFileWriter doesn't work because of content including Chineses character

I was trying to create a code to generate a combined pdf from a bunch of small pdf files while I found the script failing with UnicodeEncodeError error.
I also tried to include encoding param by
with open("Combined.pdf", "w",encoding='utf-8-sig') as outputStream:
but compiler said it needs to be binary 'wb' mode. So this isn't working.
Below is the code:
writer = PdfFileWriter()
input_stream = []
for f2 in f_re:
inputf_file = str(mypath+'\\'+f2[2])
input_stream.append(open(inputf_file,'rb'))
for reader in map(PdfFileReader, input_stream):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
with open("Combined.pdf", "wb") as outputStream:
writer.write(outputStream)
writer.save()
for f in input_stream:
f.close()
Below is error message:
Traceback (most recent call last):
File "\Workspace\Python\py_CombinPDF\py_combinePDF.py", line 89, in <module>
writer.write(outputStream)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\pdf.py", line 501, in write
obj.writeToStream(stream, key)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\generic.py", line 549, in writeToStream
value.writeToStream(stream, encryption_key)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\generic.py", line 472, in writeToStream
stream.write(b_(self))
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\utils.py", line 238, in b_
r = s.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-9: ordinal not in range(256)
Upgrading PyPDF2 solved this issue.
Now, 4 years later, people should use pypdf. It contains the latest code (I'm the maintainer of PyPDF2 and pypdf)

How to convert binary file into readable format on linux server

I am trying to convert binary file into readable format but unable to do so, please suggest how it could be achieved.
$ file test.docx
test.docx: Microsoft Word 2007+
$ file -i test.docx
test.docx: application/msword; charset=binary
$
>>> raw = codecs.open('test.docx', encoding='ascii').readlines()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/Python/installPath/lib/python2.7/codecs.py", line 694, in readlines
return self.reader.readlines(sizehint)
File "/home/Python/installPath/lib/python2.7/codecs.py", line 603, in readlines
data = self.read()
File "/home/Python/installPath/lib/python2.7/codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 18: ordinal not in range(128)
Try the below code, Working with Binary Data
with open("test_file.docx", "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
print(data)
# Seek position and read N bytes
binary_file.seek(0) # Go to beginning
couple_bytes = binary_file.read(2)
print(couple_bytes)
you'll have to read it in binary mode :
import binascii
with open('test.docx', 'rb') as f: # 'rb' stands for read binary
hexdata = binascii.hexlify(f.read()) # convert to hex
print(hexdata)
I think others have not answered this question - at least the part as #ankitpandey clarified in his comment about catdoc returning an error
" catdoc then error is This file looks like ZIP archive or Office 2007
or later file. Not supported by catdoc"
I too had just encountered this same issue with catdoc, had found this solution that worked for me
the .zip archive mention was a clue - and I was able to the following command
unzip -q -c 'test.docx' word/document.xml | python etree.py
to extract the text portion of test.docx to stdout
the python code was placed in etree.py
from lxml import etree
import sys
xml = sys.stdin.read().encode('utf-8')
root = etree.fromstring(xml)
bits_of_text = root.xpath('//text()')
# print(bits_of_text) # Note that some bits are whitespace-only
joined_text = ' '.join(
bit.strip() for bit in bits_of_text
if bit.strip() != ''
)
print(joined_text)

I get python frameworks error while reading a csv file, when I try a different easier file it works fine

import csv
exampleFile = open('example.csv')
exampleReader = csv.reader(exampleFile)
for row in exampleReader:
print('Row #' + str(exampleReader.line_num) + ' ' + str(row))
Traceback (most recent call last):
File "/Users/jossan113/Documents/Python II/test.py", line 7, in <module>
for row in exampleReader:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 4627: ordinal not in range(128)
Do anyone have any idea why I get this error? I tried an very easy cvs file from the internet and it worked just fine, but when I try the bigger file it doesn't
The file contains unicode characters, which was painful to deal with in old versions of python, since you are using 3.5 try opening the file as utf-8 and see if the issue goes away:
exampleFile = open('example.csv', encoding="utf-8")
From the docs:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
csv modeule docs

Categories

Resources