Chinese characters all become gibberish when using pandas read_stata() function

Chinese characters all become gibberish when using pandas read_stata() function - python

I'm trying to read a Stata .dta file with the python pandas package, using the read_stata() function, and the dta file has many Chinese characters in it. The file read in was all messed up codes, and the Chinese characters were all just gibberish. Any suggestions?

You'll need to specify a codec to use, the default is to decode your text as ISO-8859-1 (Latin-1):
pandas.read_stata(filename, encoding=codec_to_use)
See the pandas.read_stata() documenation:
encoding: string, None or encoding
Encoding used to parse the files. Note that Stata doesn’t support unicode. None defaults to iso-8859-1.
For Chinese, I'd guess that the codec used is either a gb* codec (gb18030, gbk, gb2312) or a UTF codec (UTF-8, UTF-16, or UTF-32). In spite of the remark in the Panda's documenation above, I see that Stata 14 supports Unicode now, and that they use UTF-8 for that.
Also see the Standard Encodings page for an overview of supported codecs.

Related

Convert string of unknown encoding to UTF-8

I am consuming a text response from a third party API. This text is in an encoding which is unknown to me. I consume the text in python3 and want to change the encoding into UTF-8.
This is an example of the contents I get:
Danke
"TrÃ¤ume groÃŸ"
ðŸ™ŒðŸ¼
Super Idee!!!
I was able to get the messed up characters readable by doing the following manually:
Open new document in Notepad++
Via the Encoding menu switch the encoding of the document to ANSI
Paste the contents
Again use the Encoding menu, this time switch to UTF-8
Now the text is properly legible like below
Correct content:
Danke
"Träume groß"
🙌🏼
Super Idee!!!
I want to repeat this process in python3, but struggle to do so. From the notepad workflow I gather that the encoding shouldn't be converted, rather the existing characters should be interpreted with a different encoding. That's because if I select Convert to UTF-8 in the Encoding menu, it doesn't work.
From what I have read on SO, there are the encode and decode methods to do that. Also ANSI isn't really an encoding but rather refers to the standard encoding the current machine uses. This would most likely be cp1525 on my windows machine. I have messed around with all combinations of cp1252 and utf-8 as source and/or target, but to no avail. I always end up with a UnicodeEncodeError.
I have also tried using the chardet module to determine the encoding of my input string, but it requires bytes as input and b'ðŸ™ŒðŸ¼' is rejected with SyntaxError: bytes can only contain ASCII literal characters.

"TrÃ¤ume groÃŸ" is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.
A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:
print('"TrÃ¤ume groÃŸ"'.encode('cp1252').decode('utf8'))
gives as expected:
"Träume groß"
But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.

You can use bytes() to convert a string to bytes, and then decode it with .decode()
>>> bytes("TrÃ¤ume groÃŸ", "cp1252").decode("utf-8")
'Träume groß'

chardet could probably be useful here -
Quoting straight from the docs
import urllib.request
rawdata = urllib.request.urlopen('http://yahoo.co.jp/').read()
import chardet
chardet.detect(rawdata) {'encoding': 'EUC-JP', 'confidence': 0.99}

python codecs can't encode to cp1252...but notepad++ can?

I have a very simple piece of code that's converting a csv....also do note i reference notepad++ a few times but my standard IDE is vs-code.
with codecs.open(filePath, "r", encoding = "UTF-8") as sourcefile:
lines = sourcefile.read()
with codecs.open(filePath, 'w', encoding = 'cp1252') as targetfile:
targetfile.write(lines)
Now the job I'm doing requires a specific file be encoded to windows-1252 and from what i understand cp1252=windows-1252. Now this conversion works fine when i do it using the UI features in notepad++, but when i try using python codecs to encode this file it fails;
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 561488: character maps to <undefined>
When i saw this failure i was confused, so i double checked the output from when i manually convert the file using notepad++, and the converted file is encoded in windows-1252.....so what gives? Why can a UI feature in notepad++ able to do the job when but codecs seems not not be able to? Does notepad++ just ignore errors?

Looks like your input text has the character "�" (the actual placeholder "replacement character" character, not some other undefined character), which cannot be mapped to cp1252 (because it doesn't have the concept).
Depending on what you need, you can:
Filter it out (or replace it, or otherwise handle it) in Python before writing out lines to the output file.
Pass errors=... to the second codecs.open, choosing one of the other error-handling modes; the default is 'strict', you can also use 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' or 'namereplace'.
Check the input file and see why it's got the "�" character; is it corrupted?

Probably Python is simply more explicit in its error handling. If Notepad++ managed to represent every character correctly in CP-1252 then there is a bug in the Python codec where it should not fail where it currently does; but I'm guessing Notepad++ is silently replacing some characters with some other characters, and falsely claiming success.
Maybe try converting the result back to UTF-8 and compare the files byte by byte if the data is not easy to inspect manually.
Uncode U+FFFD is a reserved character which serves as a placeholder for a character which cannot be represented in Unicode; often, it's an indication of a conversion problem previously, when presumably this data was imperfectly input or converted at an earlier point in time.
(And yes, Windows-1252 is another name for Windows code page 1252.)

Why notepad++ "succeeds"
Notepad++ does not offer you to convert your file to cp1252, but to reinterpret it using this encoding. What lead to your confusion is that they are actually using the wrong term for this. This is the encoding menu in the program:
When "Encode with cp1252" is selected, Notepad decodes the file using cp1252 and shows you the result. If you save the character '\ufffd' to a file using utf8:
with open('f.txt', 'w', encoding='utf8') as f:
f.write('\ufffd')`
and use "Encode with cp1252" you'd see three characters:
That means that Notepad++ does not read the character in utf8 and then writes it in cp1252, because then you'd see exactly one character. You could achieve similar results to Notepad++ by reading the file using cp1252:
with open('f.txt', 'r', encoding='cp1252') as f:
print(f.read()) # Prints ï¿½
Notepad++ lets you actually convert to only five encodings, as you can see in the screenshot above.
What should you do
This character does not exist in the cp1252 encoding, which means you can't convert this file without losing information. Common solutions are to skip such characters or replace them with other similar characters that exist in your encoding (see encoding error handlers)

You are dealing with the "utf-8-sig" encoding -- please specify this one as the encoding argument instead of "utf-8".
There is information on it in the docs (search the page for "utf-8-sig").
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. [...]

Python 2.7 with pandas to_csv() gives UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

I am using Python 2.7, and to overcome UTF-8 issues, I am using pandas to_csv method. The issue is, I am still getting unicode errors, which I dont get when I run the script on my local laptop with python 3 (not an option for batch processing).
df = pd.DataFrame(stats_results)
df.to_csv('/home/mp9293q/python_scripts/stats_dim_registration_set_column_transpose.csv', quoting=csv.QUOTE_ALL, doublequote=True, index=False,
index_label=False, header=False, line_terminator='\n', encoding='utf-8');
Gives error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

I believe you might be having one of these two problems (or maybe both):-
As you mentioned in the comments, the file in which you are trying
to save Unicode Data, already exists. Then there are quite
likely chances that the destination file, may not have
UTF-8/16/32 as it encoding scheme.
By this I mean, when the file was originally created, it's encoding
scheme may not be UTF-8, it could possibly be ANSI. So,
check whether the destination file's encoding scheme is of UTF
family or not.
Encode the Unicode string to UTF-8,
before storing it in a file. By this I mean, any content that you
are trying to save to your destination file, if contains Unicode
text, then it should be first encoded.
Ex.
# A character which could not be encoded via 8 bit ASCII
Uni_str = u"Ç"
# Converting the unicode text, into UTF-8 format
Uni_str = Uni_str.encode("utf-8")
The above code works differently in python 2.x and 3.x, the reason
being that 2.x uses ASCII as default encoding scheme, and 3.x uses
UTF-8. Another difference between the two is how they treat a string
after passing it via encode().
In Python 2.x
type(u"Ç".encode("utf-8"))
Outputs
<type 'str'>
In Python 3.x
type(u"Ç".encode("utf-8"))
Outputs
<class 'bytes'>
As you can notice, in python 2.x the return type of encode() is
string, but in 3.x it is bytes.
So for your case, I would recommend you to encode each string value containing unicode data in your dataframe using encode() before trying to store it in the file.

Why do Python unicode strings require special treatment for UTF-8 BOM?

For some reason, Python seems to be having issues with BOM when reading unicode strings from a UTF-8 file. Consider the following:
with open('test.py') as f:
for line in f:
print unicode(line, 'utf-8')
Seems straightforward, doesn't it?
That's what I thought until I ran it from command line and got:
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff'
in position 0: character maps to <undefined>
A brief visitation to Google revealed that BOM has to be cleared manually:
import codecs
with open('test.py') as f:
for line in f:
print unicode(line.replace(codecs.BOM_UTF8, ''), 'utf-8')
This one runs fine. However I'm struggling to see any merit in this.
Is there a rationale behind above-described behavior? In contrast, UTF-16 works seamlessly.

The 'utf-8-sig' encoding will consume the BOM signature on your behalf.

You wrote:
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>
When you specify the "utf-8" encoding in Python, it takes you at your word. UTF-8 files aren’t supposed to contain a BOM in them. They are neither required nor recommended. Endianness makes no sense with 8-bit code units.
BOMs screw things up, too, because you can no longer just do:
$ cat a b c > abc
if those UTF-8 files have extraneous (read: any) BOMs in them. See now why BOMs are so stupid/bad/harmful in UTF-8? They actually break things.
A BOM is metadata, not data, and the UTF-8 encoding spec makes no allowance for them the way the UTF-16 and UTF-32 specs do. So Python took you at your word and followed the spec. Hard to blame it for that.
If you are trying to use the BOM as a filetype magic number to specify the contents of the file, you really should not be doing that. You are really supposed to use a higher-level prototocl for these metadata purposes, just as you would with a MIME type.
This is just another lame Windows bug, the workaround for which is to use the alternate encoding "utf-8-sig" to pass off to Python.

error :: UnicodeDecodeError

I am getting
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 104: ordinal not in range(128)
I am using intgereproparty, stringproparty, datetimeproparty

That's because 0xb0 (decimal 176) is not a valid character code in ASCII (which defines only values between 0 and 127).
Check where you got that string from and use the proper encoding.
If you need further help, post the code.

You are trying to put Unicode data (probably text with accents) into an ASCII string.
You can use Python's codecs module to open a text file with UTF-8 encoding and write the Unicode data to it.
The .encode method may also help (u"õ".encode('utf-8') for example)

Python defaults to ASCII encoding - if you are dealing with chars outside of the ASCII range, you need to specify that in your code.
One way to do this is setting the defining the encoding at the top of your code.
This snippet sets the encoding at the top of the file to encoding to Latin-1 (which includes 0xb0):
#!/usr/bin/python
# -*- coding: latin-1 -*-
import os, sys
...
See PEP for more info on encoding.

When I write my foreign language "flashcard" programs, I always use python 3.x as its native encoding is utf-8. You're encoding problems will generally be far less frequent.
If you're working on a program that many people will share, however, you may want to consider using encode and decode with python 2.x, but only when storing and retrieving data elements in persistent storage. encode your non-ASCII characters, silently manipulate hexadecimal representations of those unicode strings in memory, and save them as hexadecimal. Finally, use decode when fetching unicode strings from persistant storage, but for end user display only. This will eliminate the need to constantly encode and decode your strings in your program.
#jcoon also has a pretty standard response to this problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.