I've been reading tutorials about Curses programming in Python, and many refer to an ability to use extended characters, such as line-drawing symbols. They're characters > 255, and the curses library knows how to display them in the current terminal font.
Some of the tutorials say you use it like this:
c = ACS_ULCORNER
...and some say you use it like this:
c = curses.ACS_ULCORNER
(That's supposed to be the upper-left corner of a box, like an L flipped vertically)
Anyway, regardless of which method I use, the name is not defined and the program thus fails. I tried "import curses" and "from curses import *", and neither works.
Curses' window() function makes use of these characters, so I even tried poking around on my box for the source to see how it does it, but I can't find it anywhere.
you have to set your local to all, then encode your output as utf-8 as follows:
import curses
import locale
locale.setlocale(locale.LC_ALL, '') # set your locale
scr = curses.initscr()
scr.clear()
scr.addstr(0, 0, u'\u3042'.encode('utf-8'))
scr.refresh()
# here implement simple code to wait for user input to quit
scr.endwin()
output:
あ
From curses/__init__.py:
Some constants, most notably the ACS_*
ones, are only added to the C
_curses module's dictionary after initscr() is called. (Some versions
of SGI's curses don't define values
for those constants until initscr()
has been called.) This wrapper
function calls the underlying C
initscr(), and then copies the
constants from the
_curses module to the curses package's dictionary. Don't do 'from curses
import *' if you'll be needing the
ACS_* constants.
In other words:
>>> import curses
>>> curses.ACS_ULCORNER
exception
>>> curses.initscr()
>>> curses.ACS_ULCORNER
>>> 4194412
I believe the below is appropriately related, to be posted under this question. Here I'll be using utfinfo.pl (see also on Super User).
First of all, for standard ASCII character set, the Unicode code point and the byte encoding is the same:
$ echo 'a' | perl utfinfo.pl
Char: 'a' u: 97 [0x0061] b: 97 [0x61] n: LATIN SMALL LETTER A [Basic Latin]
So we can do in Python's curses:
window.addch('a')
window.border('a')
... and it works as intended
However, if a character is above basic ASCII, then there are differences, which addch docs don't necessarily make explicit. First, I can do:
window.addch(curses.ACS_PI)
window.border(curses.ACS_PI)
... in which case, in my gnome-terminal, the Unicode character 'π' is rendered. However, if you inspect ACS_PI, you'll see it's an integer number, with a value of 4194427 (0x40007b); so the following will also render the same character (or rater, glyph?) 'π':
window.addch(0x40007b)
window.border(0x40007b)
To see what's going on, I grepped through the ncurses source, and found the following:
#define ACS_PI NCURSES_ACS('{') /* Pi */
#define NCURSES_ACS(c) (acs_map[NCURSES_CAST(unsigned char,c)])
#define NCURSES_CAST(type,value) static_cast<type>(value)
#lib_acs.c: NCURSES_EXPORT_VAR(chtype *) _nc_acs_map(void): MyBuffer = typeCalloc(chtype, ACS_LEN);
#define typeCalloc(type,elts) (type *)calloc((elts),sizeof(type))
#./widechar/lib_wacs.c: { '{', { '*', 0x03c0 }}, /* greek pi */
Note here:
$ echo '{π' | perl utfinfo.pl
Got 2 uchars
Char: '{' u: 123 [0x007B] b: 123 [0x7B] n: LEFT CURLY BRACKET [Basic Latin]
Char: 'π' u: 960 [0x03C0] b: 207,128 [0xCF,0x80] n: GREEK SMALL LETTER PI [Greek and Coptic]
... neither of which relates to the value of 4194427 (0x40007b) for ACS_PI.
Thus, when addch and/or border see a character above ASCII (basically an unsigned int, as opposed to unsigned char), they (at least in this instance) use that number not as Unicode code point, or as UTF-8 encoded bytes representation - but instead, they use it as a look-up index for acs_map-ping function (which ultimately, however, would return the Unicode code point, even if it emulates VT-100). That is why the following specification:
window.addch('π')
window.border('π')
will fail in Python 2.7 with argument 1 or 3 must be a ch or an int; and in Python 3.2 would render simply a space instead of a character. When we specify 'π'. we've actually specified the UTF-8 encoding [0xCF,0x80] - but even if we specify the Unicode code point:
window.addch(0x03C0)
window.border0x03C0)
... it simply renders nothing (space) in both Python 2.7 and 3.2.
That being said - the function addstr does accept UTF-8 encoded strings, and works fine:
window.addstr('π')
... but for borders - since border() apparently handles characters in the same way addch() does - we're apparently out of luck, for anything not explicitly specified as an ACS constant (and there's not that many of them, either).
Hope this helps someone,
Cheers!
Related
I'm trying to port my Vim 8.0 configuration (~/.vimrc) to Python. That is, I'm setting Vim options as fields on vim.options mapping:
import vim
# set wildmenu
vim.options['wildmenu'] = True
# set wildcharm=<C-Z>
vim.options['wildcharm'] = ord('^Z') # [Literal ^Z (ASCII 26), CTRL-V CTRL-Z]
# set wildchar=<F10>
vim.options['wildchar'] = -15211 # extracted from Vim
The wildchar and wildcharm Vim options are of type "number". As far as I understand, they expect a kind of a keycode (at least in simple cases it is the ASCII code of the character in question).
In Vimscript, when you say something like set wildchar=<F10>, Vim translates the Vim-specific textual representation into a numeric keycode.
In Python, this is not the case (vim.options['wildchar'] = '<F10>' gives a TypeError).
For simple cases, it is possible to use ord() on a string containing the literally typed control character (see above with Ctrl-Z). However, a key like F10 produces multiple characters, so I can't use ord() on it.
In the end, I want to be able to do something like this:
vim.options['wildchar'] = magic('<F10>')
Does this magic() function exist?
Edit: I'm not asking how to invoke Vimscript code from Python (i. e. vim.command(...)). I understand that the encompassing problem can be trivially solved this way, but I'm asking a different question here.
:python vim.command("set wildchar=<F10>")
See the vim.command documentation for more explanation.
I use Python IDLE a lot in my day-to-day job, mostly for short scripts and as a powerful and convenient calculator.
I usually have to work with different numeric bases (mostly decimal, hexadecimal, binary and less frequently octal and other bases.)
I know that using int(), hex(), bin(), oct() is a convenient way to move from one base to another and prefixing integer literals with the right prefix is another way to express an number.
I find it quite inconvenient to have to put a calculation in a function just to see the result in the right base (and the resulting ouput of hex() and similar functions is a string) , so what I'm trying to achieve is to have either a function (or maybe a statement?) that set the internal IDLE number representation to a known base (2, 8, 10, 16).
Example :
>>> repr_hex() # from now on, all number are considered hexadecimal, in input and in output
>>> 10 # 16 in dec
>>> 0x10 # now output is also in hexadecimal
>>> 1e + 2
>>> 0x20
# override should be possible with integer literal prefixes
# 0x: hex ; 0b: bin ; 0n: dec ; 0o: oct
>>> 0b111 + 10 + 0n10 # dec : 7 + 16 + 10
>>> 0x21 # 33 dec
# still possible to override output representation temporarily with a conversion function
>>> conv(_, 10) # conv(x, output_base, current_base=internal_base)
>>> 0n33
>>> conv(_, 2) # use prefix of previous output to set current_base to 10
>>> 0b100001
>>> conv(10, 8, 16) # convert 10 to base 8 (10 is in base 16: 0x10)
>>> 0o20
>>> repr_dec() # switch to base 10, in input and in output
>>> _
>>> 0n16
>>> 10 + 10
>>> 0n20
Implementing those features doesn't seem to be difficult, what I don't know is:
Is it possible to change number representation in IDLE?
Is it possible to do this without having to change IDLE (source code) itself? I looked at IDLE extensions, but I don't know where to start to have access to IDLE internals from there.
Thank you.
IDLE does not have a number representation. It sends the code you enter to a Python interpreter and displays the string sent back in response. In this sense, it is irrelevant that IDLE is written in Python. The same is true of any IDE or REPL for Python code.
That said, the CPython sys module has a displayhook function. For 3.5:
>>> help(sys.displayhook)
Help on built-in function displayhook in module sys:
displayhook(...)
displayhook(object) -> None
Print an object to sys.stdout and also save it in builtins._
That actually should be __builtins__._, as in the example below. Note that the input is any Python object. For IDLE, the default sys.displayhook is a function defined in idlelib/rpc.py. Here is an example relevant to your question.
>>> def new_hook(ob):
if type(ob) is int:
ob = hex(ob)
__builtins__._ = ob
print(ob)
>>> sys.displayhook = new_hook
>>> 33
0x21
>>> 0x21
0x21
This gives you the more important half of what you asked for. Before actually using anything in IDLE, I would look at the default version to make sure I did not miss anything. One could write an extension to add menu entries that would switch displayhooks.
Python intentionally does not have an input preprocessor function. GvR wants the contents of a .py file to always be python code as defined in some version of the reference manual.
I have thought about the possibility of adding an inputhook to IDLE, but I would not allow one to be active when running a .py file from the editor. If there were one added for the Shell, I would change the prompt from '>>>' to something else, such as 'hex>' or 'bin>'.
EDIT:
One could also write an extension to rewrite input code when explicitly requested either with a menu selection or a hot key or key binding. Or one could edit the current idlelib/ScriptBinding.py to make rewriting automatic. The hook I have thought about would make this easier, but not expand what can be done now.
I have C code which draws a vertical & a horizontal line in the center of screen as below:
#include<stdio.h>
#define HLINE for(i=0;i<79;i++)\
printf("%c",196);
#define VLINE(X,Y) {\
gotoxy(X,Y);\
printf("%c",179);\
}
int main()
{
int i,j;
clrscr();
gotoxy(1,12);
HLINE
for(y=1;y<25;y++)
VLINE(39,y)
return 0;
}
I am trying to convert it literally in python version 2.7.6:
import curses
def HLINE():
for i in range(0,79):
print "%c" % 45
def VLINE(X,Y):
curses.setsyx(Y,X)
print "%c" % 124
curses.setsyx(12,1)
HLINE()
for y in range(1,25):
VLINE(39,y)
My questions:
1.Do we have to change the position of x and y in setsyx function i.e, gotoxy(1,12) is setsyx(12,1) ?
2.Curses module is only available for unix not for windows?If yes, then what about windows(python 2.7.6)?
3.Why character value of 179 and 196 are � in python but in C, it is | and - respectively?
4.Above code in python is literally right or it needs some improvement?
Yes, you will have to change the argument positions. setsyx(y, x) and gotoxy(x, y)
There are Windows libraries made available. I find most useful binaries here: link
This most likely has to do with unicode formatting. What you could try to do is add the following line to the top of your python file (after the #!/usr/bin/python line) as this forces python to work with utf-8 encoding in String objects:
# -*- coding: utf-8 -*-
Your Python code to me looks acceptable enough, I wouldn't worry about it.
Yes.
Duplicate of Curses alternative for windows
Presumably you are using Python 2.x, thus your characters are bytes and therefore encoding-dependent. The meaning of a particular numeric value is determined by the encoding used. Most likely you are using utf8 on Linux and something non-utf8 in your Windows program, so you cannot compare the values. In curses you should use curses.ACS_HLINE and curses.ACS_VLINE.
You cannot mix print and curses functions, it will mess up the display. Use curses.addch or variants instead.
Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?
I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences
I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.
I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:
whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]
That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?
Kerrek SB's comment is a good one: just put the names in a comment.
BTW, Python also supports a named unicode literal:
>>> u"\N{NO-BREAK SPACE}"
u'\xa0'
But it uses the same unicode name database, and the control characters are not in it.
You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).
I don't think it can be done in standard Python. The unicodedata module uses the UnicodeData.txt v5.2.0 Unicode database. Notice that the control characters are all assigned the name <control> (the second field, semicolon-delimited).
The script Tools/unicode/makeunicodedata.py in the Python source distribution is used to generate the table used by the Python runtime. The makeunicodename function looks like this:
def makeunicodename(unicode, trace):
FILE = "Modules/unicodename_db.h"
print "--- Preparing", FILE, "..."
# collect names
names = [None] * len(unicode.chars)
for char in unicode.chars:
record = unicode.table[char]
if record:
name = record[1].strip()
if name and name[0] != "<":
names[char] = name + chr(0)
...
Notice that it skips over entries whose name begins with "<". Hence, there is no name that can be passed to unicodedata.lookup that will give you back one of those control characters.
Just hardcode the code points for horizontal tab, line feed, and carriage return, and leave a descriptive comment. As the Zen of Python goes, "practicality beats purity".
A few points:
(1) "BOM" is not a character. BOM is a byte sequence that appears at the start of a file to indicate the byte order of a file that is encoded in UTF-nn. BOM is u'\uFEFF'.encode('UTF-nn'). Reading a file with the appropriate codec will slurp up the BOM; you don't see it as a Unicode character. A BOM is not data. If you do see u'\uFEFF' in your data, treat it as a (deprecated) ZERO-WIDTH NO-BREAK SPACE.
(2) "minus the Unicode-white-space code points, which I address separately"?? Isn't NO-BREAK SPACE a "Unicode-white-space" code point?
(3) Your Python appears to be broken; mine does this:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
(4) You could use escape sequences for the first three.
>>> map(hex, map(ord, "\t\v\f"))
['0x9', '0xb', '0xc']
(5) You could use " " for the fourth one.
(6) Even if you could use names, the readers of your code would still be applying blind faith that e.g. "FORM FEED" is a whitespace character.
(7) What happened to to \r and \n?
Assuming you're working with Unicode strings, the first five items in your list, plus all other Unicode space characters, will be matched by the \s option when using a regular expression. Using Python 3.1.2:
>>> import re
>>> s = '\u0009,\u000b,\u000c,\u0020,\u00a0,\ufeff'
>>> s
'\t,\x0b,\x0c, ,\xa0,\ufeff'
>>> re.findall(r'\s', s)
['\t', '\x0b', '\x0c', ' ', '\xa0']
And as for the byte-order mark, the one given can be referred to as codecs.BOM_BE or codecs.BOM_UTF16_BE (though in Python 3+, it's returned as a bytes object rather than str).
The official Unicode recommendation for newlines may or may not be at odds with the way the Python codecs module handles newlines. Since u'\n' is often said to mean "new line", one might expect based on this recommendation for the Python string u'\n' to represent character U+2028 LINE SEPARATOR and to be encoded as such, rather than as the semantic-less control character U+000A. But I can only imagine the confusion that would result if the codecs module actually implemented that policy, and there are valid counter-arguments besides. Ditto for horizontal/vertical tab and form feed, which are probably not really characters but controls anyway. (I would certainly consider backspace to be a control, not a character.)
Your question seems to assume that treating U+000A as a control character (instead of a line separator) is wrong; but that is not at all certain. Perhaps it is more wrong for text processing applications everywhere to assume that a legacy printer-platen-scrolling control signal is really a true "line separator".
You can extend the lookup function to handle the characters that aren't included.
def unicode_lookup(x):
try:
ch = unicodedata.lookup(x)
except KeyError:
control_chars = {'LINE FEED':unichr(0x0a),'CARRIAGE RETURN':unichr(0x0d)}
if x in control_chars:
ch = control_chars[x]
else:
raise
return ch
>>> unicode_lookup('SPACE')
u' '
>>> unicode_lookup('LINE FEED')
u'\n'
>>> unicode_lookup('FORM FEED')
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
unicode_lookup('FORM FEED')
File "<pyshell#13>", line 3, in unicode_lookup
ch = unicodedata.lookup(x)
KeyError: "undefined character name 'FORM FEED'"