encode issues python word document - python

i have a huge word document file that has more than 10,000 lines in it and it contains random empty lines and also weird characters, and i want to save it as a .txt or a .fasta file to read each line as string, and run through my program to pull out only the fasta headers and their sequences.
i have searched online and all of the posts about encoding issues just make it more confusing for me.
so far i have tried:
1) save the word document file as a .txt file with unicode(UTF-8) option. and ran my code below, about 1000 lines were outputted until it hit an error.
with open('TemplateMaster2.txt', encoding='utf-8') as fin, open('OnlyFastaseq.fasta', 'w') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
fout.write(next(fin))
error message:
UnicodeEncodeError: 'charmap' codec can't encode chracter '\uf044' in position 11: character maps to <undefined>
2) save the word document file as a .txt file with unicode(UTF-8) option. about 1000 some lines were outputted until it hit a different error.
with open('TemplateMaster2.txt') as fin, open('OnlyFastaseq.fasta', 'w') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
fout.write(next(fin))
error message:
unicodeDecodeError: 'charmap' code can't decode byte 0x81 in position 5664: character map to <undefined>
I can try different options for saving that word document as a .txt file but there are too many options and i am not sure what the problem really is. Should i save the word document as .txt with the option of 'unicode' or 'unicode(Big-Endian)', or 'unicode(UTF-7)', or 'Unicode(UTF-8)', or 'US-ASCII', etc.

The only thing which seems to be missing is encoding='utf-8' in your open statement for fout.
with open('TemplateMaster2.txt', 'r', encoding='utf-8') as fin, open('OnlyFastaseq.fasta', 'w', encoding='utf-8') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
seq = next(fin)
fout.write(seq)
Did you double check if your sequences are really every time only in one line?

Related

Script crashes when trying to read a specific Japanese character from file

I was trying to save some Japanese characters from a text file into a string. Most of the characters like "道" make no problems. But others like "坂" don't work. When I'm trying to read them, my script crashes.
Do I have to use a specific encoding while reading the file?
That's my code btw:
with open(path, 'r') as file:
lines = [line.rstrip() for line in file]
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 310: character maps to <undefined>
You have to specify the encoding when working with non ASCII, like this:
file = open(filename, encoding="utf8")

Reading Hong Kong Supplementary Character Set in python 3

I have a hkscs dataset that I am trying to read in python 3. Below code
encoding = 'big5hkscs'
lines = []
num_errors = 0
for line in open('file.txt'):
try:
lines.append(line.decode(encoding))
except UnicodeDecodeError as e:
num_errors += 1
It throws me error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 0: invalid start byte. Seems like there is a non utf-8 character in the dataset that the code is not able to decode.
I tried adding errors = ignore in this line
lines.append(line.decode(encoding, errors='ignore'))
But that does not solve the problem.
Can anyone please suggest?
If a text file contains text encoded with an encoding that is not the default encoding, the encoding must be specified when opening the file to avoid decoding errors:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'r', encoding=encoding,) as f:
for line in f:
# do something with line
Alternatively, the file may be opened in binary mode, and text decoded afterwards:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'rb') as f:
for line in f:
decoded = line.decode(encoding)
# do something with decoded text
In the question, the file is opened without specifying an encoding, so its contents are automatically decoded with the default encoding - apparently UTF-8 in the is case.
Looks like if I do NOT add the except clause except UnicodeDecodeError as e, it works fine
encoding = 'big5hkscs'
lines = []
path = 'file.txt'
with open(path, encoding=encoding, errors='ignore') as f:
for line in f:
line = '\t' + line
lines.append(line)
The answer of snakecharmerb is correct, but possibly you need an explanation.
You didn't write it in the original question, but I assume you have the error on the for line. In this line you are decoding the file from UTF-8 (probably the default on your environment, so not on Windows), but later you are trying to decode it again. So the error is not about decoding big5hkscs, but about opening the file as text.
As in the good answer of snakecharmerb (second part), you should open the file as binary, so without decoding texts, and then you can decode the text with line.decode(encoding). Note: I'm not sure you can read lines on a binary file. In this manner you can still catch the errors, e.g. to write a message. Else the normal way is to decode at open() time, but then you loose the ability to fall back and to get users better error message (e.g. line number).

codec can't decode byte 0x81

I have this simple bit of code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in open(filename))
I simply want to get the number of lines in the file. However I keep getting this error. I'm thinking of just skipping Python and doing it in C# ;-)
Can anyone help? I added 'utf-8' after searching for the error and read it should fix it. The file is just a simple text file, not an image. Albeit a large file. It's actually a CSV string, but I just want to get an idea of the number of lines before I start processing it.
Many thanks.
in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4344:
character maps to <undefined>
It seems to be an encoding problem.
In your example code, you are opening the file twice, and the second doesn't include the encoding.
Try the following code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in file)
Or (more recent) :
with open(filename, "r", encoding="utf-8") as file:
num_lines = sum(1 for line in file)

Python CSV file UTF-16 to UTF-8 print error

There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.
I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Here is a full Traceback:
Traceback (most recent call last):
File "keywords.py", line 31, in <module>
main()
File "keywords.py", line 28, in main
get_csv(file_full_path)
File "keywords.py", line 19, in get_csv
for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
With the help of Stack Overflow, I got it open with:
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
Now the problem is that when I am reading the file:
def get_csv(file_full_path):
import csv, codecs
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
for row in reader:
print row
I get stuck on Asian symbols:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
I have tried decode, 'encode', unicode() on the string containing that character, but it does not seem help.
for row in reader:
#decoded_row = [element_s.decode('UTF-8') for element_s in row]
#print decoded_row
encoded_row = [element_s.encode('UTF-8') for element_s in row]
print encoded_row
At this point I do not really understand why. If I
>>> print u'\u5a07'
娇
or
>>> print '娇'
娇
it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs using UTF-16.
I am not sure where to go from here. Could anyone help out?
The csv module can not handle Unicode input. It says so specifically on its documentation page:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;
You need to convert your CSV file to UTF-8 so that the module can deal with it:
with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
with open(file_full_path + '.utf8', 'wb') as outfile:
for line in infile:
outfile.write(line.encode('utf8'))
Alternatively, you can use the command-line utility iconv to convert the file for you.
Then use that re-coded file to read your data:
reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
for row in reader:
print [c.decode('utf8') for c in row]
Note that the columns then need decoding to unicode manually.
Encode errors is what you get when you try to convert unicode characters to 8-bit sequences. So your first error is not an error get when actually reading the file, but a bit later.
You probably get this error because the Python 2 CSV module expects the files to be in binary mode, while you opened it so it returns unicode strings.
Change your opening to this:
reader = csv.reader(open(file_full_path, 'rb'), delimiter='\t', quotechar='"')
And you should be fine. Or even better:
with open(file_full_path, 'rb') as infile:
reader = csv.reader(infile, delimiter='\t', quotechar='"')
# CVS handling here.
However, you can't use UTF-16 (or UTF-32), as the separation characters are two-byte characters in UTF-16, and it will not handle this correctly, so you will need to convert it to UTF-8 first.

Python: How do I read and parse a unicode utf-8 text file?

I am exporting UTF-8 text from Excel and I want to read and parse the incoming data using Python. I've read all the online info so I've already tried this, for example:
txtFile = codecs.open( 'halout.txt', 'r', 'utf-8' )
for line in txtFile:
print repr( line )
The error I am getting is:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: unexpected code byte
Looking at the text file in a Hex editor, the first values are FFFE I've also tried:
txtFile.seek( 2 )
right after the 'open' but that just causes a different error.
That file is not UTF-8; it's UTF-16LE with a byte-order marker.
That is a BOM
EDIT, from the coments, it seems to be a utf-16 bom
codecs.open('foo.txt', 'r', 'utf-16')
should work.
Expanding on Johnathan's comment, this code should read the file correctly:
import codecs
txtFile = codecs.open( 'halout.txt', 'r', 'utf-16' )
for line in txtFile:
print repr( line )
Try to see if the excel file has some blank rows (and then has values again), that might cause the unexpected error.

Categories

Resources