How to convert Zäune to Zäune in Python 2.7 - python

I receive a text string from a third party api with garbled character encodings.
When I print that string to the command line, the string contains words like
Zäune instead of Zäune
Gartenmöbel instead of Gartenmöbel
etc.
What can I do, to fix the incoming text string with python 2.7, so it prints properly to the command line?
Thanks

In [36]: print('Zäune'.decode('utf-8').encode('cp1252').decode('utf-8').encode('latin-1'))
Zäune
In [37]: print('Gartenmöbel'.decode('utf-8').encode('cp1252').decode('utf-8').encode('latin-1'))
Gartenmöbel
I found this chain of encodings guess_chain_encodings.py which performs a brute-force search:
In [51]: 'Zäune'
Out[51]: 'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'
In [52]: 'Zäune'
Out[52]: 'Z\xc3\xa4une'
Running
guess_chain_encodings.py "'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'" "'Z\xc3\xa4une'"
yielded
'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'.decode('utf_8').encode('cp1254').decode('utf_8_sig').encode('palmos')
A little playing around suggested that cp1254 could be replaced by the (more common?) cp1252, and utf_8_sig could be replaced by utf-8, and the odd palmos could be replaced by latin-1.

The strings seem to be UTF-8 encoded twice.

Notice also the console encoding - sometimes you can see your printed strings fine in the app, but it could fail to print in the console. Here's very good guide about Unicode in Python and its using techniques.

Related

Python UTF-8 REGEX

I have a problem while trying to find text specified in regex.
Everything work perfectly fine but when i added "\£" to my regex it started causing problems. I get SyntaxError. "NON ASCII CHACTER "\xc2" in file (...) but no encoding declared...
I've tried to solve this problem with using
import sys
reload(sys) # to enable `setdefaultencoding` again
sys.setdefaultencoding("UTF-8")
but it doesnt help. I just want to build regular expression and use pound sign there. flag re.Unicode flag doesnt help, saving string as unicode (pat) doesnt help. Is there any solution to fix this regex? I just want to build regular expression and use pound sign there.Thanks for help.
k = text.encode('utf-8')
pat = u'salar.{1,6}?([0-9\-,\. \tkFFRroOMmTtAanNuUMm\$\&\;\£]{2,})'
pattern = re.compile(pat, flags = re.DOTALL|re.I|re.UNICODE)
salary = pattern.search(k).group(1)
print (salary)
Error is still there even if I comment(put "#" and skip all of those lines. Maybe its not connected with re. library but my settings?
The error message means Python cannot guess which character set you are using. It also tells you that you can fix it by telling it the encoding of your script.
# coding: utf-8
string = "£"
or equivalently
string = u"\u00a3"
Without an encoding declaration, Python sees a bunch of bytes which mean different things in different encodings. Rather than guess, it forces you to tell you what they mean. This is codified in PEP-263.
(ASCII is unambiguous [except if your system is EBCDIC I guess] so it knows what you mean if you use a pure-ASCII representation for everything.)
The encoding settings you were fiddling with affect how files and streams are read, and program I/O generally, but not how the program source is interpreted.

Python - non-English characters don't work in one case

Despite the fact I tried to find a solution to my problem both on english and my native-language sites I was unable to find a solution.
I'm querying an online dictionary to get translated words, however non-English characters are displayed as e.g. x86 or x84. However, if I just do print(the_same_non-english_character) the letter is displayed in a proper form. I use Python 3.3.2 and the HTML source of the site I extract the words from has charset=UTF-8 set.
Morever, if I use e.g. replace("x86", "non-english_character"), I don't get anything replaced, but replacing of normal characters works.
you need to escape with a \:
In [1]: s= "\x86"
In [2]: s.replace("\x86","non-english_character")
Out[2]: 'non-english_character'

Terminate multi-line string in Python console

In PyCharm, if I open a Python Console, I can't terminate a multi-line string.
Here's what happens in IDLE for comparison:
>>> words = '''one
two
three'''
>>> print(words)
one
two
three
>>>
But if I try the same thing in an interactive Python Console from within PyCharm, the console expects more input after I type the final 3 apostrophes. Anyone know why?
>>> words = '''one
... two
... three'''
...
I'm not sure what the context is, but in many cases it would just be easier to make a tuple/list from the things you want printed on different lines and join them with "\n":
>>> words = "\n".join(["one", "two", "three"])
You may also try three double-quote symbols instead. Maybe PyCharm is confused by what's being delimited. I've always wondered this in Python because strings can be concatenated just by pure juxtaposition. So effectively, '' 'one\n\two\nthree' '' ought to take the three different strings, (1) '' (2) 'one\n\two\nthree' and (3) '', and concatenate them. Since the spaces between them ought not be needed (principle of least astonishment), it's more intuitive to me that the triple-single-(or double)-quote would be interpreted that way. But since the triple version is it's own special character, it doesn't work like that.
In IPython the syntax you give works with no problem. IPython also provides a nice magic command %cpaste in which you can paste multi-line expressions or statements, and then delimit the final line with --, and upon hitting enter, it executes the pasted block. I prefer IPython (running in a buffer in Emacs) to PyCharm by a lot, but maybe you can see if there's a comparable magic function, or just look up the source for that magic function and write one yourself?

splitting unicode string into words

I am trying to split a Unicode string into words (simplistic), like this:
print re.findall(r'(?u)\w+', "раз два три")
What I expect to see is:
['раз','два','три']
But what I really get is:
['\xd1', '\xd0', '\xd0', '\xd0', '\xd0\xb2\xd0', '\xd1', '\xd1', '\xd0']
What am I doing wrong?
Edit:
If I use u in front of the string:
print re.findall(r'(?u)\w+', u"раз два три")
I get:
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Edit 2:
Aaaaand it seems like I should have read docs first:
print re.findall(r'(?u)\w+', u"раз два три")[0].encode('utf-8')
Will give me:
раз
Just to make sure though, does that sound like a proper way of approaching it?
You're actually getting the stuff you expect in the unicode case. You only think you are not because of the weird escaping due to the fact that you're looking at the reprs of the strings, not not printing their unescaped values. (This is just how lists are displayed.)
>>> words = [u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
>>> for w in words:
... print w # This uses the terminal encoding -- _only_ utilize interactively
...
раз
два
три
>>> u'раз' == u'\u0440\u0430\u0437'
True
Don't miss my remark about printing these unicode strings. Normally if you were going to send them to screen, a file, over the wire, etc. you need to manually encode them into the correct encoding. When you use print, Python tries to leverage your terminal's encoding, but it can only do that if there is a terminal. Because you don't generally know if there is one, you should only rely on this in the interactive interpreter, and always encode to the right encoding explicitly otherwise.
In this simple splitting-on-whitespace approach, you might not want to use regex at all but simply to use the unicode.split method.
>>> u"раз два три".split()
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Your top (bytestring) example does not work because re basically assumes all bytestrings are ASCII for its semantics, but yours was not. Using unicode strings allows you to get the right semantics for your alphabet and locale. As much as possible, textual data should always be represented using unicode rather than str.

Working with unicode encoded Strings from Active Directory via python-ldap

I already came up with this problem, but after some testing I decided to create a new question with some more specific Infos:
I am reading user accounts with python-ldap (and Python 2.7) from our Active Directory. This does work well, but I have problems with special chars. They do look like UTF-8 encoded strings when printed on the console. The goal is to write them into a MySQL DB, but I don't get those strings into proper UTF-8 from the beginning.
Example (fullentries is my array with all the AD entries):
fullentries[23][1].decode('utf-8', 'ignore')
print fullentries[23][1].encode('utf-8', 'ignore')
print fullentries[23][1].encode('latin1', 'ignore')
print repr(fullentries[23][1])
A second test with a string inserted by hand as follows:
testentry = "M\xc3\xbcller"
testentry.decode('utf-8', 'ignore')
print testentry.encode('utf-8', 'ignore')
print testentry.encode('latin1', 'ignore')
print repr(testentry)
The output of the first example ist:
M\xc3\xbcller
M\xc3\xbcller
u'M\\xc3\\xbcller'
Edit: If I try to replace the double backslashes with .replace('\\\\','\\) the output remains the same.
The output of the second example:
Müller
M�ller
'M\xc3\xbcller'
Is there any way to get the AD output properly encoded? I already read a lot of documentation, but it all states that LDAPv3 gives you strictly UTF-8 encoded strings. Active Directory uses LDAPv3.
My older question this topic is here: Writing UTF-8 String to MySQL with Python
Edit: Added repr(s) infos
First, know that printing to a Windows console is often the step that garbles data, so for your tests, you should print repr(s) to see the precise bytes you have in your string.
You need to find out how the data from AD is encoded. Again, print repr(s) will let you see the content of the data.
UPDATED:
OK, it looks like you're getting strange strings somehow. There might be a way to get them better, but you can adapt in any case, though it isn't pretty:
u.decode('unicode_escape').encode('iso8859-1').decode('utf8')
You might want to look into whether you can get the data in a more natural format.

Categories

Resources