Display width of unicode strings in Python [duplicate] - python

This question already has answers here:
Normalizing Unicode
(2 answers)
Closed 8 years ago.
How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()?
Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters.
>>> for title in d.keys():
>>> print("{:<20} | {}".format(title, d[title]))
zootehni- | zooteh.
zootekni- | zootek.
zoothèque | zooth.
zooveterinar- | zoovet.
zoovetinstitut- | zoovetinst.
母 | 母母
>>> s = 'è'
>>> len(s)
2
>>> [ord(c) for c in s]
[101, 768]
>>> unicodedata.name(s[1])
'COMBINING GRAVE ACCENT'
>>> s2 = '母'
>>> len(s2)
1
As can be seen, str.format() simply takes the number of code-points in the string (len(s)) as its width, leading to skewed columns in the output. Searching through the unicodedata module, I have not found anything suggesting a solution.
Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings".
Edit: Added info about normalization.
Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization:
zwemwater | zwemw.
zwia̢z- | zw.
>>> s3 = 'a\u0322' # The 'a + combining retroflex hook below' from zwiaz
>>> len(unicodedata.normalize('NFC', s3))
2

You have several options:
Some consoles support escape sequences for pixel-exact positioning of the cursor. Might cause some overprinting, though.
Historical note: This approach was used in the Amiga terminal to display images in a console window by printing a line of text and then advancing the cursor down by one pixel. The leftover pixels of the text line slowly built an image.
Create a table in your code which contains the real (pixel) widths of all Unicode characters in the font that is used in the console / terminal window. Use a UI framework and a small Python script to generate this table.
Then add code which calculates the real width of the text using this table. The result might not be a multiple of the character width in the console, though. Together with pixel-exact cursor movement, this might solve your issue.
Note: You'll have to add special handling for ligatures (fi, fl) and composites. Alternatively, you can load a UI framework without opening a window and use the graphics primitives to calculate the string widths.
Use the tab character (\t) to indent. But that will only help if your shell actually uses the real text width to place the cursor. Many terminals will simply count characters.
Create a HTML file with a table and look at it in a browser.

Related

python in Visual Studio Code - how to print funky stuff

I have been testing printing colors and characters in VS Code (version 1.69) using python 3.+. To print colored text in VS code you would use:
print("\033[31mThis is red font.\033[0m")
print("\033[32mThis is green font.\033[0m")
print("\033[33mThis is yellow font.\033[0m")
print("\033[34mThis is blue font.\033[0m")
print("\033[37mThis is the default font. \033[0m")
Special characters would be like the following:
print("\1\2\3\4\05\06\07\016\017\013\014\020")
print("\21\22\23\24\25\26\27\36\37\31\32\34\35")
Part 1 of my question: How would you print special characters from a loop? What I tried is:
for i in range(1, 99):
t = "\\" + str(i)
print(t)
Part 2: Is there a way to print dark text with a colored highlighted background?
The first example is showing ansi escape sequences, the second example is using a common convention in many languages, including Python, to include non-standard characters in a string by escaping their character value, but in your example, you may not be realising that you're escaping octal values, instead of decimal ones.
Printing them is no different from printing any character though - I think you may be confusing printing strings representing values and the actual values of variables, a very common mistake/confusion for beginning programmers. If you want to be able to print ('\21') without writing out the string, you could just print(chr(17)), because 17 is the decimal equivalent of octal 21.
Have a look at the documentation for string literals for more detail.
The loop you're trying to create would be something like:
for i in range(1, 99):
print(chr(i))
But you have to keep in mind that if i gets to 21, it's not printing '\21', but '\25' since 25 is the octal representation of the decimal value 21.
Note: also, you're asking specifically about VSCode, but that's a different question altogether. Whether or not the console in VSCode supports printing ANSI escape sequences depends on the type of terminal, it doesn't really have that much to do with what you do in your code. However, if you want ANSI escape sequences to render in text files, there's extensions for that.

Unusual font when extracting text from PDF

I have been trying to extract text from PDF files and most of the files seem to work fine. However, one particular document has text in this unusual font: in solid
I have tried extraction using PHP and then Python and both were unable to fix this font. I tried copying text and tried to see if I can get it fixed in text editing tools but couldn't do much.Please note that the original PDF document looks fine but when text is copied and pasted in a text editing tool, the gap between characters starts to appear. I am completely clueless on what to do. Please suggest a solution to fix this in PHP/Python (preferably PHP).
Pre-unicode, some character encodings allowed you to compose Japanese/Korean/Chinese characters either as two half width characters or one full width character. In that case, latin characters could be full width to be mixed evenly with the other characters. You have Full Width Latin characters on your hands and that's why the space out oddly.
You can normalize the string with NFKD compatibility decomposition to get to regular latin. This will also change any half/full width Japanese/Korean/Chinese characters by, um ... I'm not sure, but I think into characters built from multi code point characters.
>>> import unicodedata
>>> t="in solid"
>>> unicodedata.normalize("NFKC", t)
'in solid'

Python, Convert white spaces to invisible ASCII character

A quick question. I have no idea how to google for an answer.
I have a python program for taking input from users and generating string output.
I then use this string variables to populate text boxes on other software (Illustrator).
The string is made out of: number + '%' + text, e.g.
'50% Cool', '25% Alright', '25% Decent'.
These three elements are imputed into one Text Box (next to one another), and as it is with text boxes if one line does not fit the whole text, the text is moved down to another line as soon as it finds a white space ' '. Like So:
50% Cool 25% Alright 25%
Decent
I need to keep this feature in (Where text gets moved down to a lower line if it does not fit) but I need it to move the whole element and not split it.
Like So:
50% Cool 25% Alright
25% Decent
The only way I can think of to stop this from happening; is to use some sort of invisible ASCII code which connects each element together (while still retaining human visible white spaces).
Does anyone know of such ASCII connector that could be used?
So, understand first of all that what you're asking about is encoding specific. In ASCII/ANSI encoding (or Latin1), a non-breaking space can either be character 160 or character 255. (See this discussion for details.) Eg:
non_breaking_space = ord(160)
HOWEVER, that's for encoded ASCII binary strings. Assuming you're using Python 3 (which you should consider if you're not), your strings are all Unicode strings, not encoded binary strings.
And this also begs the question of how you plan on getting these strings into Illustrator. Does the user copy and paste them? Are they encoded into a file? That will affect how you want to transmit a non-breaking space.
Assuming you're using Python 3 and not worrying about encoding, this is what you want:
'Alright\u002025%'
\u0020 inserts a Unicode non-breaking space.

How to get the length of a unicode string? [duplicate]

given a character like "✮" (\xe2\x9c\xae), for example, can be others like "Σ", "д" or "Λ") I want to find the "actual" length that character takes when printed onscreen
for example
len("✮")
len("\xe2\x9c\xae")
both return 3, but it should be 1
You may try like this:
unicodedata.normalize('NFC', u'✮')
len(u"✮")
UTF-8 is an unicode encoding which uses more than one byte for special characters. Check unicodedata.normalize()
My answer to a similar question:
You are looking for the rendering width from the current output context. For graphical UIs, there is usually a method to directly query this information; for text environments, all you can do is guess what a conformant rendering engine would probably do, and hope that the actual engine matches your expectations.

Python counting zero-length control characters in string formatting width field?

In Python 3.4.3, I was trying to width-align some fields using the string.format() operator, and it appears to count zero-length control characters against the width total. Sample code:
ANSI_RED = "\033[31m"
ANSI_DEFAULT="\033[39m\033[49m"
string1 = "12"
string2 = ANSI_RED+"12"+ANSI_DEFAULT
print("foo{:4s}bar".format(string1))
print("foo{:4s}bar".format(string2))
This will output:
foo12 bar
foo12bar
(with the second output having '12' in red, but I can't reproduce that in SO)
In the second case, I've lost my field width, I assume because Python saw that the total number of chars in the string was larger than the width, despite most of those chars resulting in zero-length on an ANSI-conforming terminal.
What's a clean way of having ANSI colors and working field widths?
What's a clean way of having ANSI colors and working field widths?
Unfortunately, you will have to strip the escape sequences to get a displayed field width.
The len() function returns the number of bytes in a Python 2 str type and the number of code points in a Python 3 str type. That length has never been guaranteed to match the display width (which is a more challenging problem):
>>> s = 'abc\bde'
>>> print s
abcde
>>> len(s)
6
In general, you can't know the display width for certain unless you know something about how the display will interpret the codes (i.e. the width is different depending on whether the device supports ANSI escape sequences).
I don't know if it will qualify as "clean" but something in the vain of the following is workable:
print("foo{0}{1:4s}{2}bar".format(ANSI_RED, string1, ANSI_DEFAULT))
Getting terminal control codes right is really difficult (as seen below, not all of them have a well-defined width), so your best bet is probably to use explicit column movement.
# string2 defined as above
def col(n): return "\033[{:d}G".format(n)
print("foo{:s}{:s}bar".format(string2,col(8)))
Output:
foo12 bar

Categories

Resources