Converting ascii characters in text file to Unicode

Converting ascii characters in text file to Unicode - python

We are creating a website using Django 1.5, We have several large text files stored on the server that are to render with the web page, depending on the country. The problem is that these text files contain the copyright symbol (c) and we keep getting a 'Non-ascii character' error, and the text does not load. Does anyone have any suggestions on how to successfully convert one to the other?
Selections of the Code:
#Open file, where filename is our variable
with open(filename) as f:
#Append (It is in a loop, and we are only passing 1 document variable
document=document + f.read()
f.close
We have tried using:
mark safe (in django)
smart_str
.encode('utf8')
But to no avail, the page continues so spit back an error saying there is an ascii character that it cannot convert. Any ideas?
Here is the error we keep getting
UnicodeDecodeError at /<website-hidden>/
'ascii' codec can't decode byte 0x92 in position 950: ordinal not in range(128)

The issue is that the copyright symbol isn't a strict ASCII character, as it's 8th (most significant) bit is 1. ASCII only uses 7 bits. You need to tell python that the file isn't ASCII data, but something like "Extended ASCII", "ISO 8859-1" or "ISO Latin-1" data.
As such, you need to read it as bytes and then convert it to a string using that decoding. You can then re-encode it to anything you want, including UTF-8.
Exact handling for this depends if you are using python 2.x or 3.x.
Ref
http://www.ascii-code.com/
https://en.wikipedia.org/wiki/Extended_ASCII

A restart of the computer and eclipse seemed to do the trick. Perhaps it was a problem with the cache? Either way, strange error...

Related

Why am I keep getting a UnicodeDecodeError in pandas read_csv() function even though I specified the correct encoding parameter? [duplicate]

Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.

Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

The solution was change to "UTF-8 sin BOM"

python codecs can't encode to cp1252...but notepad++ can?

I have a very simple piece of code that's converting a csv....also do note i reference notepad++ a few times but my standard IDE is vs-code.
with codecs.open(filePath, "r", encoding = "UTF-8") as sourcefile:
lines = sourcefile.read()
with codecs.open(filePath, 'w', encoding = 'cp1252') as targetfile:
targetfile.write(lines)
Now the job I'm doing requires a specific file be encoded to windows-1252 and from what i understand cp1252=windows-1252. Now this conversion works fine when i do it using the UI features in notepad++, but when i try using python codecs to encode this file it fails;
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 561488: character maps to <undefined>
When i saw this failure i was confused, so i double checked the output from when i manually convert the file using notepad++, and the converted file is encoded in windows-1252.....so what gives? Why can a UI feature in notepad++ able to do the job when but codecs seems not not be able to? Does notepad++ just ignore errors?

Looks like your input text has the character "�" (the actual placeholder "replacement character" character, not some other undefined character), which cannot be mapped to cp1252 (because it doesn't have the concept).
Depending on what you need, you can:
Filter it out (or replace it, or otherwise handle it) in Python before writing out lines to the output file.
Pass errors=... to the second codecs.open, choosing one of the other error-handling modes; the default is 'strict', you can also use 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' or 'namereplace'.
Check the input file and see why it's got the "�" character; is it corrupted?

Probably Python is simply more explicit in its error handling. If Notepad++ managed to represent every character correctly in CP-1252 then there is a bug in the Python codec where it should not fail where it currently does; but I'm guessing Notepad++ is silently replacing some characters with some other characters, and falsely claiming success.
Maybe try converting the result back to UTF-8 and compare the files byte by byte if the data is not easy to inspect manually.
Uncode U+FFFD is a reserved character which serves as a placeholder for a character which cannot be represented in Unicode; often, it's an indication of a conversion problem previously, when presumably this data was imperfectly input or converted at an earlier point in time.
(And yes, Windows-1252 is another name for Windows code page 1252.)

Why notepad++ "succeeds"
Notepad++ does not offer you to convert your file to cp1252, but to reinterpret it using this encoding. What lead to your confusion is that they are actually using the wrong term for this. This is the encoding menu in the program:
When "Encode with cp1252" is selected, Notepad decodes the file using cp1252 and shows you the result. If you save the character '\ufffd' to a file using utf8:
with open('f.txt', 'w', encoding='utf8') as f:
f.write('\ufffd')`
and use "Encode with cp1252" you'd see three characters:
That means that Notepad++ does not read the character in utf8 and then writes it in cp1252, because then you'd see exactly one character. You could achieve similar results to Notepad++ by reading the file using cp1252:
with open('f.txt', 'r', encoding='cp1252') as f:
print(f.read()) # Prints ï¿½
Notepad++ lets you actually convert to only five encodings, as you can see in the screenshot above.
What should you do
This character does not exist in the cp1252 encoding, which means you can't convert this file without losing information. Common solutions are to skip such characters or replace them with other similar characters that exist in your encoding (see encoding error handlers)

You are dealing with the "utf-8-sig" encoding -- please specify this one as the encoding argument instead of "utf-8".
There is information on it in the docs (search the page for "utf-8-sig").
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. [...]

Python3 using UTF-8 data from bytes [duplicate]

Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.

Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

The solution was change to "UTF-8 sin BOM"

UnicodeDecodeError for Reading files in Python

pythonNotes = open('E:\\Python Notes.docx','r')
read_it_now = pythonNotes.read()
print(read_it_now.encode('utf-16'))
When I try this code, I get:
UnicodeDecodeError: 'charmap' can't decode byte 0x8f in position 591 character maps to <undefined>
I am running this in visual studio with python tools - starting without debugging.
I have tried putting enc='utf-8' at the top, throwing it in as a parameter, I've looked at other questions and just couldn't find a solution to this simple issue.
Please assist.

This error can occur when text that is already in utf-8 format is read in as an 8-bit encoding, and python tries to "decode" it to Unicode: Bytes that have no meaning in the supposed encoding throw a UnicodeDecodeError. But you'll always get an error if you try to read a file as utf-8 that is not in the utf-8 encoding.
In your case, the problem is that a docx file is not a regular text file; no single text encoding can meaningfully import it. See this SO answer for directions on how to read it on a low level, or use python-docx to get access to the document in a way that resembles what you see in Word.

International characters in Python

I'm currently working on a Python script that takes a list of log files (from a search engine) and produces a file with all the queries within these, for later analysis.
Another feature of the script is that it removes the most common words, which I've also implemented, but I've faced a problem I can't seem to overcome. The removing of words does work as intended, as long as the queries does not contain special characters. As the search logs are in Danish, the characters æ, ø and å will appear regularly.
Searching on the topic I'm now aware that I need to encode these into UTF-8, which I'm doing when obtaining the query:
tmp = t_query.encode("UTF-8").lower().split()
t_query is the query and I split it up to later compare each word with my list of forbidden words. If I do not use the encoding I'll get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 1: ordinal not in range(128)
Edit: I also tried using the decode instead, but get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa7' in position 3: ordinal not in range(128)
I loop through the words like this:
for i in tmp:
if i in words_to_filter:
tmp.remove(i)
As said this works perfectly for words not including special characters. I've tried to print the i along with the current forbidden word and will get e.g:
fÃ¦rdelsloven - færdelsloven
Where the first word is the ith element in tmp. The last word in the one from the forbidden words. Obviously something has gone wrong, but I just can't manage to find a solution. I've tried many suggestions found on Google and in here, but nothing have worked so far.
Edit 2: if it makes a difference, I've tried loading the log files both with and without the use of codec:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
I'm running Python 2.7.2 from a Windows environment, if it matters. The script should be able to run on other platforms (namely Linux and Mac OS).
I would really appreciate if one of you are able to help me out.
Best regards
Casper

If you are reading files, you want to decode them.
tmp = t_query.decode("UTF-8").lower().split()

Given a utf-8 file with json object per line, you could read all objects:
with open(filename) as file:
jlogs = [json.loads(line) for line in file]
Except for an embeded newline treatment the above code should produce the same result as yours:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
At this point all strings in jlogs are Unicode you don't need to do anything to handle "special" characters. Just make sure you are not mixing bytes and Unicode text in your code.
to get Unicode text from bytes: some_bytes.decode(character_encoding)
to get bytes from Unicode text: some_text.encode(character_encoding)
Don't encode bytes/decode Unicode.

If encoding is right and you just want to ignore unexpected characters you could use errors='ignore' or errors='replace' parameter passed to codecs.open function.
with codecs.open(file_name, encoding='utf-8', mode='r', errors='ignore') as f:
jlogs = map(json.loads, f.readlines())
Details in docs:
http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

I've finally solved it. As Lattyware Python 3.x seems to do much better. After changing the version and encoding the Python file to Unicode it works as intended.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.