Number of characters in a string (escape sequences) - python

I wrote this piece of code to count the number of characters in the string a\nb\x1f\000d
#CLASS TASK-VI
ctr=0
str1="a\nb\x1f\000d"
for i in range(len(str1)):
ctr=ctr+1
print("Number of characters in the string str1 is: ",ctr)
It returns Number of characters in the string str1 is: 6
Can someone explain this? Thanks in advance.

There are 6 characters in the string:
a
\n, which resolves to the single 'newline' character (also referred to as 'line feed')
b
\x1f, which is a hex escape sequence. \x means that the following two characters (in hexidecimal) will make up a number (in this case, 1f -> 31), and to use the character whose code is that number. Character number 31 happens to be an ASCII control character, known as 'unit separator'.
\000 is an octal escape sequence, which is the same as above but in base 8. In this case, the code it refers to is 0, which is the null character
d
Backslashes are a special control character that 'escape' the following character. Certain 'escape sequences' have special effects - here you see \n, \x, and \0, for example, though there are plenty more if you feel like looking them up. In python, you can make a string not process escape sequences by declaring it as a "raw string", which you do by starting the string with r" instead of just ":
>>> len("a\nb\x1f\000d")
6
>>> len(r"a\nb\x1f\000d")
13
You can also use a double-backslash \\ to escape a backslash, thus preventing it from escaping something else.
>>> len("a\\nb\\x1f\\000d")
13

If you want the backslashes to be counted as characters, make them double backslashes, otherwise they represent escape characters.
>>> str1 = "a\\nb\\x1f\\000d"
>>> len(str1)
13

str1= "a\nb\x1f\000d"
print(f"Number of characters in the string str1 is {len(str1)}")
# 6
str1= r"a\nb\x1f\000d"
print(f"Number of characters in the string str1 is {len(str1)}")
#13

If you source the string out of python then each character will be treated as a string.
For example, make a file named "content.txt". Inside that file type/paste any string you want with any special character, read it with python then it will be treated as a string.
i.e.
with open("content.txt") as fl:
content = fl.read()
print(len(content))
print(content)

Related

how to fix the error of converting numbers to letters in text files using python

I have a code that when I enter a number in Codeheroback it was converted to a literal .
How can i fix it:
Here my code:
import os,re
codeheroback = str(input('Nhập Code Tướng:'))
checkskinid = '''<int name="heroId" value="" refParamName="" useRefParam="false"/>'''
with open('back.xml', 'r+' ,encoding = 'utf-8') as fileback:
regexskinid = re.sub('name="heroId" value="(.*?)" refParamName' , 'name="heroId" value="\\1{}" refParamName'.format(str((codeheroback))) , checkskinid)
fileback.write(regexskinid)
here is the returned result:
<int name="heroId" value="h3" refParamName="" useRefParam="false"/>
What I want it to return:
<int name="heroId" value="503" refParamName="" useRefParam="false"/>
Your problem is that you formatted in a number after a \1 in the literal replacement string. re's substitution syntax unfortunately can't tell the difference between a \1 followed by a number and \123. Per the docs, emphasis added:
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
After formatting and string literal escapes are resolved, the substitution pattern string contains a literal \1503, and \150 is the octal escape corresponding to an ASCII h, leaving h3. You can remove the ambiguity by using a version of the group backreference that is more clearly delimited thanks to additional syntactic delimiters, replacing \\1 with \\g<1> so incidental trailing digits aren't inadvertently treated as part of the escape.

Python: re.sub return illegal characters when the source containing Chinese character [duplicate]

I want to take the string 0.71331, 52.25378 and return 0.71331,52.25378 - i.e. just look for a digit, a comma, a space and a digit, and strip out the space.
This is my current code:
coords = '0.71331, 52.25378'
coord_re = re.sub("(\d), (\d)", "\1,\2", coords)
print coord_re
But this gives me 0.7133,2.25378. What am I doing wrong?
You should be using raw strings for regex, try the following:
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
With your current code, the backslashes in your replacement string are escaping the digits, so you are replacing all matches the equivalent of chr(1) + "," + chr(2):
>>> '\1,\2'
'\x01,\x02'
>>> print '\1,\2'
,
>>> print r'\1,\2' # this is what you actually want
\1,\2
Any time you want to leave the backslash in the string, use the r prefix, or escape each backslash (\\1,\\2).
Python interprets the \1 as a character with ASCII value 1, and passes that to sub.
Use raw strings, in which Python doesn't interpret the \.
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
This is covered right in the beginning of the re documentation, should you need more info.

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Why is raw string performing inconsistently in parenthesis

For example:
a = (r'''\n1''')
b = (r'''
2''')
print(a)
print(b)
The output of this example is this:
\n1
2
Meaning that even if b is supposed to be a raw string, it does not seem to work like one, why is this?
I also checked:
if '\n' in b:
print('yes')
The output of this is yes meaning that b is a string, and indeed has \n string inside of it.
In the raw string syntax, escape sequences have no special meaning (apart from a backslash before a quote). The characters \ plus n form two characters in a raw string literal, unlike a regular string literal, where those two characters are replaced by a newline character.
An actual newline character, on the other hand, is not an escape sequence. It is just a newline character, and is included in the string as such.
Compare this to using 1 versus \x31; the latter is an escape sequence for the ASCII codepoint for the digit 1. In a regular string literal, both would give you the character 1, in a raw string literal, the escape sequence would not be interpreted:
>>> print('1\x31')
11
>>> print(r'1\x31')
1\x31
All this has nothing to do with parentheses. The parentheses do not alter the behaviour of a r'''...''' raw string. The exact same thing happens when you remove the parentheses:
>>> a = r'''\n1'''
>>> a
'\\n1'
>>> print(a)
\n1
>>> b = r'''
... 2'''
>>> b
'\n2'
>>> print(b)
2

Python regex - r prefix

Can anyone explain why example 1 below works, when the r prefix is not used?
I thought the r prefix must be used whenever escape sequences are used.
Example 2 and example 3 demonstrate this.
# example 1
import re
print (re.sub('\s+', ' ', 'hello there there'))
# prints 'hello there there' - not expected as r prefix is not used
# example 2
import re
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello there there'))
# prints 'hello there' - as expected as r prefix is used
# example 3
import re
print (re.sub('(\b\w+)(\s+\1\b)+', '\1', 'hello there there'))
# prints 'hello there there' - as expected as r prefix is not used
Because \ begin escape sequences only when they are valid escape sequences.
>>> '\n'
'\n'
>>> r'\n'
'\\n'
>>> print '\n'
>>> print r'\n'
\n
>>> '\s'
'\\s'
>>> r'\s'
'\\s'
>>> print '\s'
\s
>>> print r'\s'
\s
Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:
Escape Sequence Meaning Notes
\newline Ignored
\\ Backslash (\)
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\N{name} Character named name in the Unicode database (Unicode only)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\uxxxx Character with 16-bit hex value xxxx (Unicode only)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (Unicode only)
\v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo
\xhh Character with hex value hh
Never rely on raw strings for path literals, as raw strings have some rather peculiar inner workings, known to have bitten people in the ass:
When an "r" or "R" prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase "n". String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.
To better illustrate this last point:
>>> r'\'
SyntaxError: EOL while scanning string literal
>>> r'\''
"\\'"
>>> '\'
SyntaxError: EOL while scanning string literal
>>> '\''
"'"
>>>
>>> r'\\'
'\\\\'
>>> '\\'
'\\'
>>> print r'\\'
\\
>>> print r'\'
SyntaxError: EOL while scanning string literal
>>> print '\\'
\
the 'r' means the the following is a "raw string", ie. backslash characters are treated literally instead of signifying special treatment of the following character.
http://docs.python.org/reference/lexical_analysis.html#literals
so '\n' is a single newline
and r'\n' is two characters - a backslash and the letter 'n'
another way to write it would be '\\n' because the first backslash escapes the second
an equivalent way of writing this
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello there there'))
is
print (re.sub('(\\b\\w+)(\\s+\\1\\b)+', '\\1', 'hello there there'))
Because of the way Python treats characters that are not valid escape characters, not all of those double backslashes are necessary - eg '\s'=='\\s' however the same is not true for '\b' and '\\b'. My preference is to be explicit and double all the backslashes.
Not all sequences involving backslashes are escape sequences. \t and \f are, for example, but \s is not. In a non-raw string literal, any \ that is not part of an escape sequence is seen as just another \:
>>> "\s"
'\\s'
>>> "\t"
'\t'
\b is an escape sequence, however, so example 3 fails. (And yes, some people consider this behaviour rather unfortunate.)
Try that:
a = '\''
'
a = r'\''
\'
a = "\'"
'
a = r"\'"
\'
Check below example:
print r"123\n123"
#outputs>>>
123\n123
print "123\n123"
#outputs>>>
123
123

Categories

Resources