UnicodeDecodeError parsing file with for loop python3 - python

I got UnicodeDecodeError when I loop line in file.
with open(somefile,'r') as f:
for line in f:
#do something
This happend when I use python 3.4.
In general I have some files which contain some no UTF-8 chars. I want to parse file line by line and find line where problem apper and got exact index in line where such non utf-8 appeard. I have ready code for it but it works uner python 2.7.9 but under python 3.4 I got UnicodeDecodeError when for loop is executed.
Any ideas???

You need to open the file in binary mode and decode the lines one at a time. Try this:
with open('badutf.txt', 'rb') as f:
for i, line in enumerate(f,1):
try:
line.decode('utf-8')
except UnicodeDecodeError as e:
print ('Line: {}, Offset: {}, {}'.format(i, e.start, e.reason))
Here is the result I get in Python3:
Line: 16, Offset: 6, invalid start byte
Sure enough, line 16, position 6 is the bad byte.

Related

Editing UTF-8 text file on Windows

I'm trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.
This is the code:
input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
new_line = line.replace(" ", "+")
new_line2 = new_line.replace("\t", "+")
out.write(new_line2)
#print(new_line2)
fh.close()
out.close()
It gives me an error:
Traceback (most recent call last):
File "music.py", line 3, in <module>
for line in input:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>
As music.txt is saved in UTF-8, I changed the first line to:
input = open('music.txt', 'r', encoding="utf8")
This gives another error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u039b' in position 21: character maps to <undefined>
I tried other things with the out.write() but it didn't work.
This is the raw data of music.txt.
https://pastebin.com/FVsVinqW
I saved it in windows editor as UTF-8 .txt file.
If your system's default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.
with open('music.txt', 'r', encoding='utf-8') as infh,\
open("out.txt", "w", encoding='utf-8') as outfh:
for line in infh:
line = line.replace(" ", "+").replace("\t", "+")
outfh.write(line)
This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.
Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.

How to convert binary file into readable format on linux server

I am trying to convert binary file into readable format but unable to do so, please suggest how it could be achieved.
$ file test.docx
test.docx: Microsoft Word 2007+
$ file -i test.docx
test.docx: application/msword; charset=binary
$
>>> raw = codecs.open('test.docx', encoding='ascii').readlines()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/Python/installPath/lib/python2.7/codecs.py", line 694, in readlines
return self.reader.readlines(sizehint)
File "/home/Python/installPath/lib/python2.7/codecs.py", line 603, in readlines
data = self.read()
File "/home/Python/installPath/lib/python2.7/codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 18: ordinal not in range(128)
Try the below code, Working with Binary Data
with open("test_file.docx", "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
print(data)
# Seek position and read N bytes
binary_file.seek(0) # Go to beginning
couple_bytes = binary_file.read(2)
print(couple_bytes)
you'll have to read it in binary mode :
import binascii
with open('test.docx', 'rb') as f: # 'rb' stands for read binary
hexdata = binascii.hexlify(f.read()) # convert to hex
print(hexdata)
I think others have not answered this question - at least the part as #ankitpandey clarified in his comment about catdoc returning an error
" catdoc then error is This file looks like ZIP archive or Office 2007
or later file. Not supported by catdoc"
I too had just encountered this same issue with catdoc, had found this solution that worked for me
the .zip archive mention was a clue - and I was able to the following command
unzip -q -c 'test.docx' word/document.xml | python etree.py
to extract the text portion of test.docx to stdout
the python code was placed in etree.py
from lxml import etree
import sys
xml = sys.stdin.read().encode('utf-8')
root = etree.fromstring(xml)
bits_of_text = root.xpath('//text()')
# print(bits_of_text) # Note that some bits are whitespace-only
joined_text = ' '.join(
bit.strip() for bit in bits_of_text
if bit.strip() != ''
)
print(joined_text)

I get python frameworks error while reading a csv file, when I try a different easier file it works fine

import csv
exampleFile = open('example.csv')
exampleReader = csv.reader(exampleFile)
for row in exampleReader:
print('Row #' + str(exampleReader.line_num) + ' ' + str(row))
Traceback (most recent call last):
File "/Users/jossan113/Documents/Python II/test.py", line 7, in <module>
for row in exampleReader:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 4627: ordinal not in range(128)
Do anyone have any idea why I get this error? I tried an very easy cvs file from the internet and it worked just fine, but when I try the bigger file it doesn't
The file contains unicode characters, which was painful to deal with in old versions of python, since you are using 3.5 try opening the file as utf-8 and see if the issue goes away:
exampleFile = open('example.csv', encoding="utf-8")
From the docs:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
csv modeule docs

How to load a formatted txt file into Python to be searched

I have a file that is formatted with different indentation and which is several hundred lines long and I have tried various methods to load it into python as a file and variable but have not been successful. What would be an efficient way to load the file. My end goal is to load the file, and and search it for a specific line of text.
with open('''C:\Users\Samuel\Desktop\raw.txt''') as f:
for line in f:
if line == 'media_url':
print line
else:
print "void"
Error: Traceback (most recent call last): File "<pyshell#35>", line 1, in <module> with open('''C:\Users\Samuel\Desktop\raw''') as f: IOError: [Errno 22] invalid mode ('r') or filename: 'C:\\Users\\Samuel\\Desktop\raw
If you're trying to search for a specific line, then it's much better to avoid loading the whole file in:
with open('filename.txt') as f:
for line in f:
if line == 'search string': # or perhaps: if 'search string' in line:
# do something
If you're trying to search for the presence of a specific line while ignoring indentation, you'll want to use
if line.strip() == 'search string'.strip():
in order to strip off the leading (and trailing) whitespace before comparing.
The following is the standard way of reading a file's contents into a variable:
with open("filename.txt", "r") as f:
contents = f.read()
Use the following if you want a list of lines instead of the whole file in a string:
with open("filename.txt", "r") as f:
contents = list(f.read())
You can then search for text with
if any("search string" in line for line in contents):
print 'line found'
Python uses backslash to mean "escape". For Windows paths, this means giving the path as a "raw string" -- 'r' prefix.
lines have newlines attached. To compare, strip them.
with open(r'C:\Users\Samuel\Desktop\raw.txt') as f:
for line in f:
if line.rstrip() == 'media_url':
print line
else:
print "void"

Error: ValueError: invalid literal for int() with base 10: ''

I have the value 1028 in the file build_ver.txt ,getting the below error while running the following script,script is trying to increment the count by 1 and write the value back to the file..please suggest how to overcome this?
with open(r'\\Network\Build_ver\build_ver.txt','w+') as f:
value = int(f.read())
f.seek(0)
f.write(str(value + 1))
Error:-
Traceback (most recent call last):
File "build_ver.py", line 2, in <module>
value = int(f.read())
ValueError: invalid literal for int() with base 10: ''
This is what opening a file in w+ mode does:
w+
Open for reading and writing. The file is created if it does
not exist, otherwise it is truncated. The stream is positioned at the beginning of the file.
Emphasis mine. Your file is empty, read() will give you an empty string.
Perhaps you want to open in r+ mode?
You can also use fileinput to modify the file "in-place":
import fileinput
for line in fileinput.input('\\Network\Build_ver\build_ver.txt', inplace=True):
print str(int(line) + 1)
Everything printed inside the loop is written back to the file.

Categories

Resources