ZWNJ not shown properly in python 3.3 - python

I am trying to replace the space between two tokens written in the Arabic alphabet with a ZWNJ but what the function returns is not decoded properly on the screen:
>>> nm.normalize("رشته ها")
'رشته\u200cها'
\u200 should be rendered as a half-space that would be placed between 'رشته' and 'ها' here, but it gets messed up like that. I am using Python 3.3.3

The function returned a string object with the \u200c character as part of it, but Python shows you the representation. The \uxxxx syntax is used to make the representation useful as a debugging value, you can now copy that representation and paste it back into Python and get the exact same value.
In other words, the function worked exactly as advertised; the space was indeed replaced by a U+200C ZERO WIDTH NON-JOINER codepoint.
If you wanted to write the string to your terminal or console, use print():
print(nm.normalize("رشته ها"))
Demo:
>>> result = 'رشته\u200cها'
>>> len(result)
7
>>> result[4]
'\u200c'
>>> print(result)
رشته‌ها
You can see that character 5 (index 4) is a single character here, not 6 separate characters.

Related

Python 3.6 literal strings

I'm having a hard time finding what can be put inside literal strings.
For example, I've seen this code on the PEP above, yet I didn't find any information above about what it does.
>>> value = 1234
>>> f'input={value:#06x}'
'input=0x04d2'
Is there a tutorial for understanding string literals better?
What's new here is that you can just literally write value inside the f-string and Python will insert it.
The #06x part is nothing new and just a way to format numbers in hexadecimal representation. Python2:
>>> value = 1234
>>> '{:#06x}'.format(value)
'0x04d2'
# says to prefix the output (here with 0x).
06 says pad left with zeroes such that the output has at least length 6.
x is the hex format specifier.
You can read all about it here.

Do not transform special characters in Python

I just want to print the original string.
[Case1] I know put "r" before the string can work
print r'123\r\n567"45'
>>`
123\r\n567"45
[Case2] But when it is a Variable
aaa = '123\r\n567"45'
print aaa
>>
123
567"45
Is there any function can print aaa with the same effect like Case1?
The obvious way to make Case 2 work like Case 1 is to use a raw string in your assignment statement:
aaa = r'123\r\n567"45'
Now when you print aaa, you'll get the actual backslashes and r and n characters, rather than a carriage return and a newline.
If you're actually loading aaa from some other source (rather than using a string literal), your task is a little bit more complicated. You'll actually need to transform the string in some way to get the output you want.
One simple way of doing something close to what you want is to use the repr function:
aaa = some_function() # returns '123\r\n567"45' and some_function can't be changed
print repr(aaa)
This will not quite do what you want though, since it will add quotation marks around the string's text. If you care about that, you could remove them with a slice:
print repr(aaa)[1:-1]
Another approach to take is to manually transform the characters you want escaped, e.g. with str.replace or str.translate. This is easy to do if you only care about escaping a few special characters and not others.
print aaa.replace('\r', r'\r').replace('\n', r'\n')
A final option is to use str.encode with the special character set called unicode-escape, which will escape all characters that are not printable ASCII:
print aaa.encode('unicode-escape')
This only works as intended in Python 2 however. In Python 3, str.encode always returns a bytes instance which you'd need to decode again to get a str (aaa.encode('unicode-escape').decode('ascii') would work, but it's really ugly).
You can do it using repr function in python:
If you are using python 2 then :
If you are using python 3 then :
If what you really want is just to print the original string, instead of prepending an r before the literal, you may want to use the python native function repr. E.g.
>>> aaa = '123\r\n567"45'
>>> print repr(aaa)
'123\r\n567"45'
which is equivalent in this (exact) case to
>>> print repr('123\r\n567"45')
'123\r\n567"45'

ElementTree will not parse special characters with Python 2.7

I had to rewrite my python script from python 3 to python2 and after that I got problem parsing special characters with ElementTree.
This is a piece of my xml:
<account number="89890000" type="Kostnad" taxCode="597" vatCode="">Avsättning egenavgifter</account>
This is the ouput when I parse this row:
('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avs\xc3\xa4ttning egenavgifter')
So it seems to be a problem with the character "ä".
This is how i do it in the code:
sys.setdefaultencoding( "UTF-8" )
xmltree = ET()
xmltree.parse("xxxx.xml")
printAccountPlan(xmltree)
def printAccountPlan(xmltree):
print("account:",str(i.attrib['number']), "AccountType:",str(i.attrib['type']),"Name:",str(i.text))
Anyone have an ide to get the ElementTree parse the charracter "ä", so the result will be like this:
('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')
You're running into two separate differences between Python 2 and Python 3 at the same time, which is why you're getting unexpected results.
The first difference is one you're probably already aware of: Python's print statement in version 2 became a print function in version 3. That change is creating a special circumstance in your case, which I'll get to a little later. But briefly, this is the difference in how 'print' works:
In Python 3:
>>> # Two arguments 'Hi' and 'there' get passed to the function 'print'.
>>> # They are concatenated with a space separator and printed.
>>> print('Hi', 'there')
>>> Hi there
In Python 2:
>>> # 'print' is a statement which doesn't need parenthesis.
>>> # The parenthesis instead create a tuple containing two elements
>>> # 'Hi' and 'there'. This tuple is then printed.
>>> print('Hi', 'there')
>>> ('Hi', 'there')
The second problem in your case is that tuples print themselves by calling repr() on each of their elements. In Python 3, repr() displays unicode as you want. But in Python 2, repr() uses escape characters for any byte values which fall outside the printable ASCII range (e.g., larger than 127). This is why you're seeing them.
You may decide to resolve this issue, or not, depending on what you're goal is with your code. The representation of a tuple in Python 2 uses escape characters because it's not designed to be displayed to an end-user. It's more for your internal convenience as a developer, for troubleshooting and similar tasks. If you're simply printing it for yourself, then you may not need to change a thing because Python is showing you that the encoded bytes for that non-ASCII character are correctly there in your string. If you do want to display something to the end-user which has the format of how tuples look, then one way to do it (which retains correct printing of unicode) is to manually create the formatting, like this:
def printAccountPlan(xmltree):
data = (i.attrib['number'], i.attrib['type'], i.text)
print "('account:', '%s', 'AccountType:', '%s', 'Name:', '%s')" % data
# Produces this:
# ('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

Is it possible to suppress Python's escape sequence processing on a given string without using the raw specifier?

Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?
I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences
I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.

splitting unicode string into words

I am trying to split a Unicode string into words (simplistic), like this:
print re.findall(r'(?u)\w+', "раз два три")
What I expect to see is:
['раз','два','три']
But what I really get is:
['\xd1', '\xd0', '\xd0', '\xd0', '\xd0\xb2\xd0', '\xd1', '\xd1', '\xd0']
What am I doing wrong?
Edit:
If I use u in front of the string:
print re.findall(r'(?u)\w+', u"раз два три")
I get:
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Edit 2:
Aaaaand it seems like I should have read docs first:
print re.findall(r'(?u)\w+', u"раз два три")[0].encode('utf-8')
Will give me:
раз
Just to make sure though, does that sound like a proper way of approaching it?
You're actually getting the stuff you expect in the unicode case. You only think you are not because of the weird escaping due to the fact that you're looking at the reprs of the strings, not not printing their unescaped values. (This is just how lists are displayed.)
>>> words = [u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
>>> for w in words:
... print w # This uses the terminal encoding -- _only_ utilize interactively
...
раз
два
три
>>> u'раз' == u'\u0440\u0430\u0437'
True
Don't miss my remark about printing these unicode strings. Normally if you were going to send them to screen, a file, over the wire, etc. you need to manually encode them into the correct encoding. When you use print, Python tries to leverage your terminal's encoding, but it can only do that if there is a terminal. Because you don't generally know if there is one, you should only rely on this in the interactive interpreter, and always encode to the right encoding explicitly otherwise.
In this simple splitting-on-whitespace approach, you might not want to use regex at all but simply to use the unicode.split method.
>>> u"раз два три".split()
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Your top (bytestring) example does not work because re basically assumes all bytestrings are ASCII for its semantics, but yours was not. Using unicode strings allows you to get the right semantics for your alphabet and locale. As much as possible, textual data should always be represented using unicode rather than str.

Categories

Resources