Unicode character not in range when calling locale.strxfrm

Unicode character not in range when calling locale.strxfrm - python

I am experiencing an odd behavior when using the locale library with unicode input. Below is a minimum working example:
>>> x = '\U0010fefd'
>>> ord(x)
1113853
>>> ord('\U0010fefd') == 0X10fefd
True
>>> ord(x) <= 0X10ffff
True
>>> import locale
>>> locale.strxfrm(x)
'\U0010fefd'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: character U+110000 is not in range [U+0000; U+10ffff]
I have seen this on Python 3.3, 3.4 and 3.5. I do not get an error on Python 2.7.
As far as I can see, my unicode input is within the appropriate unicode range, so it seems that somehow something internal to strxfrm when using the 'en_US.UTF-8' is moving the input out of range.
I am running Mac OS X, and this behavior may be related to http://bugs.python.org/issue23195... but I was under the impression this bug would only manifest as incorrect results, not a raised exception. I cannot replicate on my SLES 11 machine, and others confirm they cannot replicate on Ubuntu, Centos, or Windows. It may be instructive to hear about other OS's in the comments.
Can someone explain what may be happening here under the hood?

In Python 3.x, the function locale.strxfrm(s) internally uses the POSIX C function wcsxfrm(), which is based on current LC_COLLATE setting. The POSIX standard define the transformation in this way:
The transformation shall be such that if wcscmp() is applied to two
transformed wide strings, it shall return a value greater than, equal
to, or less than 0, corresponding to the result of wcscoll() applied
to the same two original wide-character strings.
This definition can be implemented in multiple ways, and doesn't even require that the resulting string is readable.
I've created a little C code example to demonstrate how it works:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wchar_t buf[10];
wchar_t *in = L"\x10fefd";
int i;
setlocale(LC_COLLATE, "en_US.UTF-8");
printf("in : ");
for(i=0;i<10 && in[i];i++)
printf(" 0x%x", in[i]);
printf("\n");
i = wcsxfrm(buf, in, 10);
printf("out: ");
for(i=0;i<10 && buf[i];i++)
printf(" 0x%x", buf[i]);
printf("\n");
}
It prints the string before and after the transformation.
Running it on Linux (Debian Jessie) this is the result:
in : 0x10fefd
out: 0x1 0x1 0x1 0x1 0x552
while running it on OSX (10.11.1) the result is:
in : 0x10fefd
out: 0x103 0x1 0x110000
You can see that the output of wcsxfrm() on OSX contains the character U+110000 which is not permitted in a Python string, so this is the source of the error.
On Python 2.7 the error is not raised because its locale.strxfrm() implementation is based on strxfrm() C function.
UPDATE:
Investigating further, I see that the LC_COLLATE definition for en_US.UTF-8 on OSX is a link to la_LN.US-ASCII definition.
$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Oct 1 14:24 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
I found the actual definition in the sources from Apple. The content of file la_LN.US-ASCII.src is the following:
order \
\x00;...;\xff
2nd UPDATE:
I've further tested the wcsxfrm() function on OSX. Using the la_LN.US-ASCII collate, given a sequence of wide character C1..Cn as input, the output is a string with this form:
W1..Wn \x01 U1..Un
where
Wx = 0x103 if Cx > 0xFF else Cx+0x3
Ux = Cx+0x103 if Cx > 0xFF else Cx+0x3
Using this algorithm \x10fefd become 0x103 0x1 0x110000
I've checked and every UTF-8 locale use this collate on OSX, so I'm inclined to say that the collate support for UTF-8 on Apple systems is broken. The resulting ordering is almost the same of the one obtained whith normal byte comparison, with the bonus of the ability to obtain illegal Unicode characters.

Related

depythonifying 'char', got 'str' for pyobjc

0
Story would be: I was using a hardware which can be automatic controlled by a objc framework, it was already used by many colleagues so I can see it as a "fixed" library. But I would like to use it via Python, so with pyobjc I can already connect to this device, but failed to send data into it.
The objc command in header is like this
(BOOL) executeabcCommand:(NSString*)commandabc
withArgs:(uint32_t)args
withData:(uint8_t*)data
writeLength:(NSUInteger)writeLength
readLength:(NSUInteger)readLength
timeoutMilliseconds:(NSUInteger)timeoutMilliseconds
error:(NSError **) error;
and from my python code, data is an argument which can contain 256bytes of data such
as 0x00, 0x01, 0xFF. My python code looks like this:
senddata=Device.alloc().initWithCommunicationInterface_(tcpInterface)
command = 'ABCw'
args= 0x00
writelength = 0x100
readlength = 0x100
data = '\x50\x40'
timeout = 500
success, error = senddata.executeabcCommand_withArgs_withData_writeLength_readLength_timeoutMilliseconds_error_(command, args, data, writelength, readlength, timeout, None)
Whatever I sent into it, it always showing that.
ValueError: depythonifying 'char', got 'str'
I tired to dig in a little bit, but failed to find anything about convert string or list to char with pyobjc

Objective-C follows the rules that apply to C.
So in objc as well as C when we look at uint8_t*, it is in fact the very same as char* in memory. string differs from this only in that sense that it is agreed that the last character ends in \0 to indicate that the char* block that we call string has its cap. So char* blocks end with \0 because, well its a string.
What do we do in C to find out the length of a character block?
We iterate the whole block until we find \0. Usually with a while loop, and break the loop when you find it, your counter inside the loop tells you your length if you did not give it somehow anyway.
It is up to you to interpret the data in the desired format.
Which is why sometime it is easier to cast from void* or to take indeed a char* block which is then cast to and declared as uint8_t data inside the function which makes use if it. Thats the nice part of C to be able to define that as you wish, use that force that was given to you.
So to make your life easier, you could define a length parameter like so
-withData:(uint8_t*)data andLength:(uint64_t)len; to avoid parsing the character stream again, as you know already it is/or should be 256 characters long. The only thing you want to avoid at all cost in C is reading attempts at indices that are out of bound throwing an BAD_ACCESS exception.
But this basic information should enable you to find a way to declare your char* block containing uint8_t data addressed with the very first pointer (*) which also contains the first uint8_t character of the block as str with a specific length or up to the first appearance of \0.
Sidenote:
objective-c #"someNSString" == pythons u"pythonstring"
PS: in your question is not clear who throw that error msg.
Python? Because it could not interpret the data when receiving?
Pyobjc? Because it is python syntax hell when you mix with objc?
The objc runtime? Because it follows the strict rules of C as well?

Python has always been very forgiving about shoe-horning one type into another, but python3 uses Unicode strings by default, which need to be converted into binary strings before plugging into pyobjc methods.

Try specifying the strings as byte objects as b'this'
I was hitting the same error trying to use IOKit:
import objc
from Foundation import NSBundle
IOKit = NSBundle.bundleWithIdentifier_('com.apple.framework.IOKit')
functions = [("IOServiceGetMatchingService", b"II#"), ("IOServiceMatching", b"#*"),]
objc.loadBundleFunctions(IOKit, globals(), functions)
The problem arose when I tried to call the function like so:
IOServiceMatching('AppleSmartBattery')
Receiving
Traceback (most recent call last):
File "<pyshell#53>", line 1, in <module>
IOServiceMatching('AppleSmartBattery')
ValueError: depythonifying 'charptr', got 'str'
While as a byte object I get:
IOServiceMatching(b'AppleSmartBattery')
{
IOProviderClass = AppleSmartBattery;
}

Differences in Python struct module lengths

I'm attempting to use Python's struct module to decode some binary headers from a GPS system. I have two types of header, long and short, and I have an example of reading each one below:
import struct
import binascii
packed_data_short = binascii.unhexlify('aa44132845013b078575e40c')
packed_data_long = binascii.unhexlify('aa44121ca20200603400000079783b07bea9bd0c00000000cc5dfa33')
print packed_data_short
print len(packed_data_short)
sS = struct.Struct('c c c B H H L')
unpacked_data_short = sS.unpack(packed_data_short)
print 'Unpacked Values:', unpacked_data_short
print ''
print packed_data_long
print len(packed_data_long)
sL = struct.Struct('c c c B H c b H H b c H L L H H')
unpacked_data_long = sL.unpack(packed_data_long)
print 'Unpacked Values:', unpacked_data_long
In both cases I get the length I am expecting - 12 bytes for a short header and 28 bytes for a long one. In addition all the fields appear correctly and (to the best of my knowledge with old data) are sensible values. All good so far.
I move this across onto another computer (running a different version of Python - 2.7.6 as opposed to 2.7.11) and I get different struct lengths using calcsize, and get errors trying to pass it the length I've both calculated and the other version is content with. Now the short header is expecting 16 bytes and the long one 36 bytes.
If I pass the larger amount it is asking for most of the records are find until the "L" records. In the long example the first one is as expected but the second one, which should just be 0, is not correct, and consequently the two fields after are also incorrect. In light of the number of bytes the function wants I noticed that it is 4 for each of the "L"s, and indeed just running struct.calcsize('L') I get 8 for the length in 2.7.6 and 4 for 2.7.11. This at least narrows down where the problem is, but I don't understand why it is happening.
At present I'm updating the second computer to Python 2.7.11 (will update once I have it), but I can't find anything in the struct documentation which would suggest there has been a change to this. Is there anything I have clearly missed or is this simply a version problem?
The documentation I have been referring to is here.
EDIT: Further to comment regarding OS - one is a 64 bit version of Windows 7 (the one which works as expected), the second is a 64 bit version of Ubuntu 14.04.

This is not a bug; see struct documentation:
Note
By default, the result of packing a given C struct includes pad bytes
in order to maintain proper alignment for the C types involved;
similarly, alignment is taken into account when unpacking. This
behavior is chosen so that the bytes of a packed struct correspond
exactly to the layout in memory of the corresponding C struct. To
handle platform-independent data formats or omit implicit pad bytes,
use standard size and alignment instead of native size and alignment:
see Byte Order, Size, and Alignment for details.
To decode the data from that GPS device, you need to use < or > in your format string as described in 7.3.2.1. Byte Order, Size, and Alignment. Since you got it working on the other machine, I presume the data is in little-endian format, and it would work portably if you used
sS = struct.Struct('<cccBHHL')
sL = struct.Struct('<cccBHcbHHbcHLLHH')
whose sizes are always
>>> sS.size
12
>>> sL.size
28
Why did they differ? The original computer you're using is either a Windows machine or a 32-bit machine, and the remote machine is a 64-bit *nix. In native sizes, L means the type unsigned long of a C compiler. In 32-bit Unixen and all Windows versions, this is 32-bit wide.
In 64-bit Unixes the standard ABI on x86 is LP64 which means that long and pointers are 64-bit wide. However, Windows uses LLP64; only long long is 64-bit there; the reason is that lots of code and even Windows API itself has for long relied on long being exactly 32 bits.
With < flag present, L and I both are always guaranteed to be 32-bit. There was no problem with other field specifiers because their size remains the same on all x86 platforms and operating systems.

Changing number representation in IDLE

I use Python IDLE a lot in my day-to-day job, mostly for short scripts and as a powerful and convenient calculator.
I usually have to work with different numeric bases (mostly decimal, hexadecimal, binary and less frequently octal and other bases.)
I know that using int(), hex(), bin(), oct() is a convenient way to move from one base to another and prefixing integer literals with the right prefix is another way to express an number.
I find it quite inconvenient to have to put a calculation in a function just to see the result in the right base (and the resulting ouput of hex() and similar functions is a string) , so what I'm trying to achieve is to have either a function (or maybe a statement?) that set the internal IDLE number representation to a known base (2, 8, 10, 16).
Example :
>>> repr_hex() # from now on, all number are considered hexadecimal, in input and in output
>>> 10 # 16 in dec
>>> 0x10 # now output is also in hexadecimal
>>> 1e + 2
>>> 0x20
# override should be possible with integer literal prefixes
# 0x: hex ; 0b: bin ; 0n: dec ; 0o: oct
>>> 0b111 + 10 + 0n10 # dec : 7 + 16 + 10
>>> 0x21 # 33 dec
# still possible to override output representation temporarily with a conversion function
>>> conv(_, 10) # conv(x, output_base, current_base=internal_base)
>>> 0n33
>>> conv(_, 2) # use prefix of previous output to set current_base to 10
>>> 0b100001
>>> conv(10, 8, 16) # convert 10 to base 8 (10 is in base 16: 0x10)
>>> 0o20
>>> repr_dec() # switch to base 10, in input and in output
>>> _
>>> 0n16
>>> 10 + 10
>>> 0n20
Implementing those features doesn't seem to be difficult, what I don't know is:
Is it possible to change number representation in IDLE?
Is it possible to do this without having to change IDLE (source code) itself? I looked at IDLE extensions, but I don't know where to start to have access to IDLE internals from there.
Thank you.

IDLE does not have a number representation. It sends the code you enter to a Python interpreter and displays the string sent back in response. In this sense, it is irrelevant that IDLE is written in Python. The same is true of any IDE or REPL for Python code.
That said, the CPython sys module has a displayhook function. For 3.5:
>>> help(sys.displayhook)
Help on built-in function displayhook in module sys:
displayhook(...)
displayhook(object) -> None
Print an object to sys.stdout and also save it in builtins._
That actually should be __builtins__._, as in the example below. Note that the input is any Python object. For IDLE, the default sys.displayhook is a function defined in idlelib/rpc.py. Here is an example relevant to your question.
>>> def new_hook(ob):
if type(ob) is int:
ob = hex(ob)
__builtins__._ = ob
print(ob)
>>> sys.displayhook = new_hook
>>> 33
0x21
>>> 0x21
0x21
This gives you the more important half of what you asked for. Before actually using anything in IDLE, I would look at the default version to make sure I did not miss anything. One could write an extension to add menu entries that would switch displayhooks.
Python intentionally does not have an input preprocessor function. GvR wants the contents of a .py file to always be python code as defined in some version of the reference manual.
I have thought about the possibility of adding an inputhook to IDLE, but I would not allow one to be active when running a .py file from the editor. If there were one added for the Shell, I would change the prompt from '>>>' to something else, such as 'hex>' or 'bin>'.
EDIT:
One could also write an extension to rewrite input code when explicitly requested either with a menu selection or a hot key or key binding. Or one could edit the current idlelib/ScriptBinding.py to make rewriting automatic. The hook I have thought about would make this easier, but not expand what can be done now.

Curses library in python

I have C code which draws a vertical & a horizontal line in the center of screen as below:
#include<stdio.h>
#define HLINE for(i=0;i<79;i++)\
printf("%c",196);
#define VLINE(X,Y) {\
gotoxy(X,Y);\
printf("%c",179);\
}
int main()
{
int i,j;
clrscr();
gotoxy(1,12);
HLINE
for(y=1;y<25;y++)
VLINE(39,y)
return 0;
}
I am trying to convert it literally in python version 2.7.6:
import curses
def HLINE():
for i in range(0,79):
print "%c" % 45
def VLINE(X,Y):
curses.setsyx(Y,X)
print "%c" % 124
curses.setsyx(12,1)
HLINE()
for y in range(1,25):
VLINE(39,y)
My questions:
1.Do we have to change the position of x and y in setsyx function i.e, gotoxy(1,12) is setsyx(12,1) ?
2.Curses module is only available for unix not for windows?If yes, then what about windows(python 2.7.6)?
3.Why character value of 179 and 196 are � in python but in C, it is | and - respectively?
4.Above code in python is literally right or it needs some improvement?

Yes, you will have to change the argument positions. setsyx(y, x) and gotoxy(x, y)
There are Windows libraries made available. I find most useful binaries here: link
This most likely has to do with unicode formatting. What you could try to do is add the following line to the top of your python file (after the #!/usr/bin/python line) as this forces python to work with utf-8 encoding in String objects:
# -*- coding: utf-8 -*-
Your Python code to me looks acceptable enough, I wouldn't worry about it.

Yes.
Duplicate of Curses alternative for windows
Presumably you are using Python 2.x, thus your characters are bytes and therefore encoding-dependent. The meaning of a particular numeric value is determined by the encoding used. Most likely you are using utf8 on Linux and something non-utf8 in your Windows program, so you cannot compare the values. In curses you should use curses.ACS_HLINE and curses.ACS_VLINE.
You cannot mix print and curses functions, it will mess up the display. Use curses.addch or variants instead.

How do I use extended characters in Python's curses library?

I've been reading tutorials about Curses programming in Python, and many refer to an ability to use extended characters, such as line-drawing symbols. They're characters > 255, and the curses library knows how to display them in the current terminal font.
Some of the tutorials say you use it like this:
c = ACS_ULCORNER
...and some say you use it like this:
c = curses.ACS_ULCORNER
(That's supposed to be the upper-left corner of a box, like an L flipped vertically)
Anyway, regardless of which method I use, the name is not defined and the program thus fails. I tried "import curses" and "from curses import *", and neither works.
Curses' window() function makes use of these characters, so I even tried poking around on my box for the source to see how it does it, but I can't find it anywhere.

you have to set your local to all, then encode your output as utf-8 as follows:
import curses
import locale
locale.setlocale(locale.LC_ALL, '') # set your locale
scr = curses.initscr()
scr.clear()
scr.addstr(0, 0, u'\u3042'.encode('utf-8'))
scr.refresh()
# here implement simple code to wait for user input to quit
scr.endwin()
output:
あ

From curses/__init__.py:
Some constants, most notably the ACS_*
ones, are only added to the C
_curses module's dictionary after initscr() is called. (Some versions
of SGI's curses don't define values
for those constants until initscr()
has been called.) This wrapper
function calls the underlying C
initscr(), and then copies the
constants from the
_curses module to the curses package's dictionary. Don't do 'from curses
import *' if you'll be needing the
ACS_* constants.
In other words:
>>> import curses
>>> curses.ACS_ULCORNER
exception
>>> curses.initscr()
>>> curses.ACS_ULCORNER
>>> 4194412

I believe the below is appropriately related, to be posted under this question. Here I'll be using utfinfo.pl (see also on Super User).
First of all, for standard ASCII character set, the Unicode code point and the byte encoding is the same:
$ echo 'a' | perl utfinfo.pl
Char: 'a' u: 97 [0x0061] b: 97 [0x61] n: LATIN SMALL LETTER A [Basic Latin]
So we can do in Python's curses:
window.addch('a')
window.border('a')
... and it works as intended
However, if a character is above basic ASCII, then there are differences, which addch docs don't necessarily make explicit. First, I can do:
window.addch(curses.ACS_PI)
window.border(curses.ACS_PI)
... in which case, in my gnome-terminal, the Unicode character 'π' is rendered. However, if you inspect ACS_PI, you'll see it's an integer number, with a value of 4194427 (0x40007b); so the following will also render the same character (or rater, glyph?) 'π':
window.addch(0x40007b)
window.border(0x40007b)
To see what's going on, I grepped through the ncurses source, and found the following:
#define ACS_PI NCURSES_ACS('{') /* Pi */
#define NCURSES_ACS(c) (acs_map[NCURSES_CAST(unsigned char,c)])
#define NCURSES_CAST(type,value) static_cast<type>(value)
#lib_acs.c: NCURSES_EXPORT_VAR(chtype *) _nc_acs_map(void): MyBuffer = typeCalloc(chtype, ACS_LEN);
#define typeCalloc(type,elts) (type *)calloc((elts),sizeof(type))
#./widechar/lib_wacs.c: { '{', { '*', 0x03c0 }}, /* greek pi */
Note here:
$ echo '{π' | perl utfinfo.pl
Got 2 uchars
Char: '{' u: 123 [0x007B] b: 123 [0x7B] n: LEFT CURLY BRACKET [Basic Latin]
Char: 'π' u: 960 [0x03C0] b: 207,128 [0xCF,0x80] n: GREEK SMALL LETTER PI [Greek and Coptic]
... neither of which relates to the value of 4194427 (0x40007b) for ACS_PI.
Thus, when addch and/or border see a character above ASCII (basically an unsigned int, as opposed to unsigned char), they (at least in this instance) use that number not as Unicode code point, or as UTF-8 encoded bytes representation - but instead, they use it as a look-up index for acs_map-ping function (which ultimately, however, would return the Unicode code point, even if it emulates VT-100). That is why the following specification:
window.addch('π')
window.border('π')
will fail in Python 2.7 with argument 1 or 3 must be a ch or an int; and in Python 3.2 would render simply a space instead of a character. When we specify 'π'. we've actually specified the UTF-8 encoding [0xCF,0x80] - but even if we specify the Unicode code point:
window.addch(0x03C0)
window.border0x03C0)
... it simply renders nothing (space) in both Python 2.7 and 3.2.
That being said - the function addstr does accept UTF-8 encoded strings, and works fine:
window.addstr('π')
... but for borders - since border() apparently handles characters in the same way addch() does - we're apparently out of luck, for anything not explicitly specified as an ACS constant (and there's not that many of them, either).
Hope this helps someone,
Cheers!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode character not in range when calling locale.strxfrm - python

Related

depythonifying 'char', got 'str' for pyobjc

Differences in Python struct module lengths

Changing number representation in IDLE

Curses library in python

How do I use extended characters in Python's curses library?

Categories

Resources