Reading the word "beyoncè" from a text file, python is handling it as "beyonc\xc3\xa9".
If I write it into a file, it shows correctly, but in console It shows like that.
Also If I try to use it in my program I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128)
how can I let Python read beyoncè from a text file as beyonce and getting rid of this problem?
See if this helps:
f= open('mytextfile.txt', encoding='utf-8', 'w')
f.write(line)
try
string="beyonc\xc3\xa9"
string.decode("utf-8")
foo=open("foo.txt","wb")
foo.write(string)
foo.close()
Related
I am working with python2.7 and nltk on a large txt file of content scraped from various websites..however I am getting various unicode errors such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
My question is not so much how I can 'fix' this with python but instead is there anything I can do to the .txt file (as in formatting) before 'feeding' it to python, such as 'make plain text' to avoid this issue entirely?
Update:
I looked around and found a solution within python that seems to work perfectly:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
try opening the file with:
f = open(fname, encoding="ascii", errors="surrogateescape")
Change the "ascii" with the desired encoding.
I am using well known 20 Newsgroups data set for text categorisation in jupyter. When I try to open and read file on my Mac, it fails at decoding step. I tried to read the file in byte format, which works but I further need to work with it as string. I tried to encode it but it fails with the error.
Code
with open(file_path, 'rb') as f:
file_read=f.read()
file_read.decode("us-ascii")
Error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)
Us-ascii is file encoding I found when typing in terminal: file -I file_name. I tried other encodings but none works.
I further want to remove punctuation and count words in the file. Is there a way how to overcome this issue?
It is tricky without looking at the file. However this works most of the time
from codecs import open
file_path = "file_name"
with open(file_path, 'rb') as f:
file_read=f.read()
Setting error to ignore resolved the problem, thanks N M. The code looks like:
ref_file=open(ref_file_path, 'r', encoding='ascii', errors='ignore')
file_read=ref_file.read()
The code further treats it as a one big string. Note that although the error was about decoding 0xff it was not UTF-16 coding.
I was trying to read a txt file in python. After the following input:
f = open("test.txt","r") #opens file with name of "test.txt"
print(f.read(1))
print(f.read())
instead of looking at the text I'm returned this:
how do i visualize the output?
Thanks
I think you need to go line by line. Be careful, if its a big text file this could go on for awhile.
for line in f:
print line.decode('utf-8').strip()
some strings not being read correctly, so you'll need the decode line.
See:UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)
try this setup...
it worked for my console printing problems:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
I have a problem with writing to file in unicode. I am using python 2.7.3. It gives me such an error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 1006: character maps to <undefined>
Here is a sample of my code: error is on line: f3.write(text)
f = codecs.open("PopupMessages.strings", encoding='utf-16')
text = f.read()
print text
f.close()
f3 = codecs.open("3.txt", encoding='utf-16', mode='w')
f3.write(text)
f3.close()
I tried to use 'utf-8' and 'utf-8-sig' also, but it doesn't helped me. I have such symbols in my source file to read: ['\",;?*&$##%] and symbols in different languages.
How can I solve this issue? Please help, I read info on stackoverflow firstly, but it didn't helped me.
delete this line:
print text
and it should work
I wrote the below code for writing the contents to the file,
with codecs.open(name,"a","utf8") as myfile:
myfile.write(str(eachrecord[1]).encode('utf8'))
myfile.write(" ")
myfile.write(str(eachrecord[0]).encode('utf8'))
myfile.write("\n")`
The above code does not work properly when writing uni-code characters....Even though i am using codecs and doing the encoding. I keep getting the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 6: ordinal not in range(128)
Can anyone see where i am doing it wrong ?
Edit:
with codecs.open(name,"a","utf8") as myfile:
myfile.write(unicode(eachrecord[1]))
myfile.write(" ")
myfile.write(unicode(eachrecord[0]))
myfile.write("\n")
This worked..Thanks for all the quick comments and answers..that really helps ..I did not realize that python has 'unicode' option until you guys told me
Remove the str() calls. They are doing an implicit encoding with the default encoding (ascii in this case).