New line and tab characters in python on mac - python

I am printing a string to the python shell on a mac os 10.7.3.
The string contains new line characters, \n\r, and tabs, \t.
I'm not 100% sure what new line characters are used on each platform, however i've tried every combination (\n, \n\r, \r) and the newline characters are printed on the shell:
'hello\n\r\tworld'
I don't know what i'm doing wrong, any help is appreciated.
Thanks

What look to you like newlines and carriage returns are actually two characters each -- a back slash plus a normal character.
Fix this by using your_string.decode('string_escape'):
>>> s = 'hello\\n\\r\\tworld' # or s = r'hello\n\r\tworld'
>>> print s
hello\n\r\tworld
>>> print repr(s)
'hello\\n\\r\\tworld'
>>> print s.decode('string_escape')
hello
world

Related

Hwo to write newline without escape sequence

I wonder how to write \n, not using \n. What is the 'raw' way to write a new line?
Instead of
print("Hello\nWorld")
Output
Hello
World
I want
print("HelloSOMEENCODINGWorld)
Output
Hello
World
Is there a way to use ASCII, Hex, ... within the string?
You can use multi-line strings.
print("""Hello
World""")
But \n is better
You can use bash ANSI Escape Sequences:
print('Line1 \033[1B\033[50000DLine2',)
# \033[1B Gets cursor to the line below
# \033[50000D Gets the cursor 50000 spaces to the left , 50000 is just a random big number
You can refer this for more info on Bash ANSI Escape Sequences
One of the options is print('Hello{}World'.format(chr(10))).
One way of doing this might be through os.linesep
import os
print('This is a line with a line break\nin the middle')
print(f'This is a line with a line break {os.linesep}in the middle')
But as stated here:
Note: when writing to files using the Python API, do not use the os.linesep. Just use \n; Python automatically translates that to the proper newline character for your platform.

Why print("\5") output a new line in python

What is the difference between print("\n") and print("\5")?
I tried below in a python shell.
Why does print("\5") output a new line:
>>> print("\n")
>>> print("\5")
>>>
But when I tried:
print("\4")
print("\6")
It's printing some binary data
Whenever you use print in python, it puts a newline at the end. The thing you should pay attention to is how many newlines are in the output.
"\5" is just a character (it's the control characters ENQ in ASCII; while it is technically non-printable, my terminal renders it as ♣); printing it outputs whatever your terminal decides to use to render it followed by a newline. print("") will output a newline. print("\n") by contrast will output two newlines.
If your terminal can't/won't render \5 (it is a non-printable character after all), print("\5") will be the same as print("").

Format string confusing

Hi im new to Python and Ive been trying to get format print to work but, and this may be me being new, but it seems to be very badly implemented.Any examples for 2.7.6 dont work for the new version and their aren't any real examples I could find on the internet for 3.3. As such I would like to ask for a good example of how format string works. For instance ive been trying to get this to work from my homework.
day,date,year,hour,and minutes must be separate variables.
using one formatted print statement,print the following:
Date:5/31/2013
Time: 3:45 pm
I can get it to work with this code:
def date():
Month=5
Day=31
Year=2013
Hours=3
Minutes=45
Scale='pm'
a="Date: %i/%i/%i\nTime: %i:%i %s" %(Month,Day,Year,Hours,Minutes,Scale)
print(a)
It works but its not one line as asked for. Please help format is so confusing.
The \n in your format string is inserting the new line character. Remove the \n, and you will not have the newline any longer.
Characters preceded by a backslash are known as escape characters. They can be used to insert special formatting into strings. For example:
\n is the newline character,
\t is the tab character
Remove \n because that is used to create a line break.

Shell text to python string

I'm writing a little python utility to help move our shell -help documentation to searchable webpages, but I hit a weird block :
output = subprocess.Popen([sys.argv[1], '--help'],stdout=subprocess.PIPE).communicate()[0]
output = output.split('\n')
print output[4]
#NAME
for l in output[4]:
print l
#N
#A
#
#A
#M
#
#M
#E
#
#E
#or when written, n?na?am?me?e
It does this for any heading/subheading in the documentation, which makes it near unusable.
Any tips on getting correct formatting? Where did I screw up?
Thanks
The documentation contains overstruck characters done in the ancient line-printer way: print each character, followed by a backspace (\b aka \x08), followed by the same character again. So "NAME" becomes "N\bNA\bAM\bME\bE". If you can convince the program not to output that way, it would be the best; otherwise, you can clean it up with something like output = re.sub(r'\x08.', '', output)
A common way to mark a character as bold in a terminal is to print the character, followed by a backspace characters, followed by the character itself again (just like you would do it on a mechanical typewriter). Terminal emulators like xterm detect such sequences and turn them into bold characters. Programs shouldn't be printing such sequences if stdout is not a terminal, but if your tool does, you will have to clean up the mess yourself.

How can I determine a Unicode character from its name in Python, even if that character is a control character?

I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:
whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]
That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?
Kerrek SB's comment is a good one: just put the names in a comment.
BTW, Python also supports a named unicode literal:
>>> u"\N{NO-BREAK SPACE}"
u'\xa0'
But it uses the same unicode name database, and the control characters are not in it.
You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).
I don't think it can be done in standard Python. The unicodedata module uses the UnicodeData.txt v5.2.0 Unicode database. Notice that the control characters are all assigned the name <control> (the second field, semicolon-delimited).
The script Tools/unicode/makeunicodedata.py in the Python source distribution is used to generate the table used by the Python runtime. The makeunicodename function looks like this:
def makeunicodename(unicode, trace):
FILE = "Modules/unicodename_db.h"
print "--- Preparing", FILE, "..."
# collect names
names = [None] * len(unicode.chars)
for char in unicode.chars:
record = unicode.table[char]
if record:
name = record[1].strip()
if name and name[0] != "<":
names[char] = name + chr(0)
...
Notice that it skips over entries whose name begins with "<". Hence, there is no name that can be passed to unicodedata.lookup that will give you back one of those control characters.
Just hardcode the code points for horizontal tab, line feed, and carriage return, and leave a descriptive comment. As the Zen of Python goes, "practicality beats purity".
A few points:
(1) "BOM" is not a character. BOM is a byte sequence that appears at the start of a file to indicate the byte order of a file that is encoded in UTF-nn. BOM is u'\uFEFF'.encode('UTF-nn'). Reading a file with the appropriate codec will slurp up the BOM; you don't see it as a Unicode character. A BOM is not data. If you do see u'\uFEFF' in your data, treat it as a (deprecated) ZERO-WIDTH NO-BREAK SPACE.
(2) "minus the Unicode-white-space code points, which I address separately"?? Isn't NO-BREAK SPACE a "Unicode-white-space" code point?
(3) Your Python appears to be broken; mine does this:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
(4) You could use escape sequences for the first three.
>>> map(hex, map(ord, "\t\v\f"))
['0x9', '0xb', '0xc']
(5) You could use " " for the fourth one.
(6) Even if you could use names, the readers of your code would still be applying blind faith that e.g. "FORM FEED" is a whitespace character.
(7) What happened to to \r and \n?
Assuming you're working with Unicode strings, the first five items in your list, plus all other Unicode space characters, will be matched by the \s option when using a regular expression. Using Python 3.1.2:
>>> import re
>>> s = '\u0009,\u000b,\u000c,\u0020,\u00a0,\ufeff'
>>> s
'\t,\x0b,\x0c, ,\xa0,\ufeff'
>>> re.findall(r'\s', s)
['\t', '\x0b', '\x0c', ' ', '\xa0']
And as for the byte-order mark, the one given can be referred to as codecs.BOM_BE or codecs.BOM_UTF16_BE (though in Python 3+, it's returned as a bytes object rather than str).
The official Unicode recommendation for newlines may or may not be at odds with the way the Python codecs module handles newlines. Since u'\n' is often said to mean "new line", one might expect based on this recommendation for the Python string u'\n' to represent character U+2028 LINE SEPARATOR and to be encoded as such, rather than as the semantic-less control character U+000A. But I can only imagine the confusion that would result if the codecs module actually implemented that policy, and there are valid counter-arguments besides. Ditto for horizontal/vertical tab and form feed, which are probably not really characters but controls anyway. (I would certainly consider backspace to be a control, not a character.)
Your question seems to assume that treating U+000A as a control character (instead of a line separator) is wrong; but that is not at all certain. Perhaps it is more wrong for text processing applications everywhere to assume that a legacy printer-platen-scrolling control signal is really a true "line separator".
You can extend the lookup function to handle the characters that aren't included.
def unicode_lookup(x):
try:
ch = unicodedata.lookup(x)
except KeyError:
control_chars = {'LINE FEED':unichr(0x0a),'CARRIAGE RETURN':unichr(0x0d)}
if x in control_chars:
ch = control_chars[x]
else:
raise
return ch
>>> unicode_lookup('SPACE')
u' '
>>> unicode_lookup('LINE FEED')
u'\n'
>>> unicode_lookup('FORM FEED')
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
unicode_lookup('FORM FEED')
File "<pyshell#13>", line 3, in unicode_lookup
ch = unicodedata.lookup(x)
KeyError: "undefined character name 'FORM FEED'"

Categories

Resources