Why does my python regex not work? - python

I wanna replace all the chars which occur more than one time,I used Python's re.sub and my regex looks like this data=re.sub('(.)\1+','##',data), But nothing happened...
Here is my Text:
Text
※※※※※※※※※※※※※※※※※Chapter One※※※※※※※※※※※※※※※※※※
This is the begining...

You need to use raw string here, 1 is interpreted as octal and then its ASCII value present at its integer equivalent is used in the string.
>>> '\1'
'\x01'
>>> chr(01)
'\x01'
>>> '\101'
'A'
>>> chr(0101)
'A'
Use raw string to fix this:
>>> '(.)\1+'
'(.)\x01+'
>>> r'(.)\1+' #Note the `r`
'(.)\\1+'

Use a raw string, so the regex engine interprets backslashes instead of the Python parser. Just put an r in front of the string:
data=re.sub(r'(.)\1+', '##', data)
^ this r is the important bit
Otherwise, \1 is interpreted as character value 1 instead of a backreference.

Related

Python regex to check if a string has particular set of numbers

I want to check if this set of number appears in a string in an exact pattern or not:
String I want to check: \4&2096297&0
My code
a = "SCSI\DISK&VEN_MICRON&PROD_1100\4&2096297&0&000200"
print(bool(re.match(r"\4&2096297&0+", a)))
It returns False instead of true. If I try same thing on print(bool(re.match(r"hello[0-9]+", 'hello1'))). I get true. Where am I going wrong?
import re
pattern = "\4&2096297&0"
print(bool(re.search(pattern,a))) # this would print "True"
\4 refers to a character whose ordinal number is octal 4 rather than a literal backslash followed by a 4. You should use a raw string literal for the variable a instead:
a = r"SCSI\DISK&VEN_MICRON&PROD_1100\4&2096297&0&000200"
Also, instead of using r"\4&2096297&0+" as your regex, you should use double backslashes to denote a literal backslash so that \4 would not be interpreted as a backreference:
r"\\4&2096297&0+"
And finally, instead of re.match, you should use re.search since re.match matches the regex from the beginning of the string, which is not what you want.
So:
import re
a = r"SCSI\DISK&VEN_MICRON&PROD_1100\4&2096297&0&000200"
print(bool(re.search(r"\\4&2096297&0+", a)))
would output: True

Why is raw string performing inconsistently in parenthesis

For example:
a = (r'''\n1''')
b = (r'''
2''')
print(a)
print(b)
The output of this example is this:
\n1
2
Meaning that even if b is supposed to be a raw string, it does not seem to work like one, why is this?
I also checked:
if '\n' in b:
print('yes')
The output of this is yes meaning that b is a string, and indeed has \n string inside of it.
In the raw string syntax, escape sequences have no special meaning (apart from a backslash before a quote). The characters \ plus n form two characters in a raw string literal, unlike a regular string literal, where those two characters are replaced by a newline character.
An actual newline character, on the other hand, is not an escape sequence. It is just a newline character, and is included in the string as such.
Compare this to using 1 versus \x31; the latter is an escape sequence for the ASCII codepoint for the digit 1. In a regular string literal, both would give you the character 1, in a raw string literal, the escape sequence would not be interpreted:
>>> print('1\x31')
11
>>> print(r'1\x31')
1\x31
All this has nothing to do with parentheses. The parentheses do not alter the behaviour of a r'''...''' raw string. The exact same thing happens when you remove the parentheses:
>>> a = r'''\n1'''
>>> a
'\\n1'
>>> print(a)
\n1
>>> b = r'''
... 2'''
>>> b
'\n2'
>>> print(b)
2

How to find string between '\begin{minipage}' and '\end{minipage}' by python re?

I have tried the following code:
strReFindString = u"\\begin{minipage}"+"(.*?)"
strReFindString += u"\\end{minipage}"
lst = re.findall(strReFindString, strBuffer, re.DOTALL)
But it always returns empty list.
How can I do?
Thanks all.
As #BrenBarn said, u"\\b" parses as \b; and \b is not a valid regexp escape, so findall treats it as b (literal b). u"\\\\b" is \\b, which regexp understands as \b (literal backslash, literal b). You can prevent escape-parsing in the string using raw strings, ur"\\b" is equal to u"\\\\b":
ur"\\b" == u"\\\\b"
# => True

Replace String in python with matched pattern

I have to remove any punctuation marks from the start and at the end of the word.
I am using re.sub to do it.
re.sub(r'(\w.+)(?=[^\w]$)','\1',text)
Grouping not working out - all I get is ☺. for Mihir4. in command line
If you have string with multiple words, such as
text = ".adfdf. 'df' !3423? ld! :sdsd"
this will do the trick (it will also work for single words, of course):
>>> re.sub(r'[^\w\s]*(\w+)[^\w\s]*', r'\1', text)
'adfdf df 3423 ld sdsd'
Notice the r in r'\1'. This is equivalent to '\\1'.
>>> re.sub(r'[^\w\s]*(\w+)[^\w\s]*', '\\1', text)
'adfdf df 3423 ld sdsd'
Further reading: the backslash plague
The string literal '\1' is equivalent to '\x01'. You need to escape it or use raw string literal to mean backreference group 1.
BTW, you don't need to use the capturing group.
>>> re.sub(r'^[^-\w]+|[^-\w]$', '', 'Mihir4.')
'Mihir4'

Python - why doesn't this simple regex work?

This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.

Categories

Resources