Script crashes when trying to read a specific Japanese character from file - python

I was trying to save some Japanese characters from a text file into a string. Most of the characters like "道" make no problems. But others like "坂" don't work. When I'm trying to read them, my script crashes.
Do I have to use a specific encoding while reading the file?
That's my code btw:
with open(path, 'r') as file:
lines = [line.rstrip() for line in file]
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 310: character maps to <undefined>

You have to specify the encoding when working with non ASCII, like this:
file = open(filename, encoding="utf8")

Related

How to put Arabic text in a txt file

I have been trying to put Arabic text in a .txt file and when do so using the code bellow I get this error: UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>
code:
Log1 = open("File.txt", "a")
Log1.write("سلام")
Log1.close()
This question was asked in stack overflow many times but all of them has suggested using utf-8 which will output \xd8\xb3\xd9\x84\xd8\xa7\xd9\x85 for this case, I was wondering if there is anyways to make the thing in the text file look like the سلام instead of \xd8\xb3\xd9\x84\xd8\xa7\xd9\x85.
This works perfectly:
FILENAME = 'foo.txt'
with open(FILENAME, "w", encoding='utf-8') as data:
data.write("سلام")
Then from a zsh shell:
cat foo.txt
سلام

ERROR: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 715: character maps to <undefined>

I am using Python Notebook to open this text file in Windows 10. However UTF-8 encoding isn't working. How should I solve this error?
Python is trying to open the file using your system's default encoding, but the bytes in the file cannot be decoded with that encoding.
You need to pass the correct encoding to open. We don't know what the correct encoding is, but the most likely candidates are UTF-8 or, less common these days, latin-1. So you would do something like
with open('myfile.txt', 'r', encoding='UTF-8') as f:
for line in f:
# do something with the line

Python - Reading CSV UnicodeError

I have exported a CSV from Kaggle - https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis. However, when I attempt to iterate through the file, I receive unicode errors concerning certain characters that cannot be encoded.
File "C:\Program Files\Python35\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 264: character maps to
I have enabled utf-8 encoding while opening the file, which I assumed would have decoded the ASCII characters. Evidently not.
My Code:
with open("sentimentDataSet.csv", "r", encoding="utf-8" ,errors='ignore', newline='') as file:
reader = csv.reader(file)-
for row in reader:
if row:
print(row)
if row[sentimentCsvColumn] == sentimentScores(row[textCsvColumn]):
accuracyCount += 1
print(accuracyCount)
That's an encode error as you're printing the row, and has little to do with reading the actual CSV.
Your Windows terminal is in CP850 encoding, which can't represent everything.
There are some things you can do here.
A simple way is to set the PYTHONIOENCODING environment variable to a combination that will trash things it can't represent. set PYTHONIOENCODING=cp850:replace before running Python will have Python replace characters unrepresentable in CP850 with question marks.
Change your terminal encoding to UTF-8: chcp 65001 before running Python.
Encode the thing by hand before printing: print(str(data).encode('ascii', 'replace'))
Don't print the thing.

what encoding should I use to open a file with a big letter N with tilde character?

I'm trying to open a file with a big letter N tilde (http://graphemica.com/%C3%91) but I can't seem to figure it out. when I open the file in notepad++ it shows the character as xD1, when I open the file in gedit it shows \D1. When I open the file in excel, it shows the character correctly.
Now I'm trying to open the file in python, it halts when it encounters the character. I'm aware that I can put in the encoding so the file can be opened properly but I'm not sure which encoding I should use. My error is
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
this is my code
with codecs.open('tsv.txt', 'r', 'utf8') as my_file:
for line in my_file:
print(line)
if it is not utf8, then what should I use? From the site above, it does not show which encoding 0xd1 is associated with.
You can find in tables how 'Ñ' gets encoded in different encodings.
You can also try it directly with Python:
>>> 'Ñ'.encode('utf8')
b'\xc3\x91'
>>> 'Ñ'.encode('latin1')
b'\xd1'
It seems that your file is encoded in latin-1.

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128) chinese characters

Im trying to write Chinese characters into a text file from a SQL output called result.
result looks like this:
[('你好吗', 345re4, '2015-07-20'), ('我很好',45dde2, '2015-07-20').....]
This is my code:
#result is a list of tuples
file = open("my.txt", "w")
for row in result:
print >> file, row[0].encode('utf-8')
file.close()
row[0] contains Chinese text like this: 你好吗
I also tried:
print >> file, str(row[0]).encode('utf-8')
and
print >> file, 'u'+str(row[0]).encode('utf-8')
but both gave the same error.
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128)
Found a simple solution instead of doing encoding and decoding by formatting the file to "utf-8" from the beginning using codecs.
import codecs
file = codecs.open("my.txt", "w", "utf-8")
Don't forget to ad the UTF8 BOM on the file beginning if you wish to view your file in text editor correctly:
file = open(...)
file.write("\xef\xbb\xbf")
for row in result:
print >> file, u""+row[0].decode("mbcs").encode("utf-8")
file.close()
I think you'll have to decode from your machines default encoding to unicode(), then encode it as UTF-8.
mbcs represents (at least it did ages a go) default encoding on Windows.
But do not rely on that.
Did you try the codecs module?

Categories

Resources