regex match&replace for raw string

regex match&replace for raw string - python

I don't get why following behavior of the re.sub. Anybody can explain how the strings are processed in re.sub? why statement 2 doesn't match&replace? thanks
1. >>> re.sub(r'\$abc', 'ABC', r'\$abcdefg')
'\\ABCdefg'
2. >>> re.sub(r'\\$abc', 'ABC', r'\\$abcdefg')
'\\\\$abcdefg'
3. >>> r'\\$abc' in r'\\$abcdefg'
True
4. >>> re.sub(r'\\\$abc', 'ABC', r'\\\$abcdefg')
'\\\\ABCdefg'

It is because in the pattern (first arg) the double back slashes \\ turns into only one, and thus won't match.
It occurs because when the regex engine see one slash \ it automatically escape the next character resulting \\ in just one literal \.
To do the regex match the two slashes you should add two more slashes:
2. >>> re.sub(r'\\\\\$abc', 'ABC', r'\\\$abcdefg')
'\ABCdefg'
Reference: re.sub
Hope it helps.

Related

Python regex to check if a string has particular set of numbers

I want to check if this set of number appears in a string in an exact pattern or not:
String I want to check: \4&2096297&0
My code
a = "SCSI\DISK&VEN_MICRON&PROD_1100\4&2096297&0&000200"
print(bool(re.match(r"\4&2096297&0+", a)))
It returns False instead of true. If I try same thing on print(bool(re.match(r"hello[0-9]+", 'hello1'))). I get true. Where am I going wrong?

import re
pattern = "\4&2096297&0"
print(bool(re.search(pattern,a))) # this would print "True"

\4 refers to a character whose ordinal number is octal 4 rather than a literal backslash followed by a 4. You should use a raw string literal for the variable a instead:
a = r"SCSI\DISK&VEN_MICRON&PROD_1100\4&2096297&0&000200"
Also, instead of using r"\4&2096297&0+" as your regex, you should use double backslashes to denote a literal backslash so that \4 would not be interpreted as a backreference:
r"\\4&2096297&0+"
And finally, instead of re.match, you should use re.search since re.match matches the regex from the beginning of the string, which is not what you want.
So:
import re
a = r"SCSI\DISK&VEN_MICRON&PROD_1100\4&2096297&0&000200"
print(bool(re.search(r"\\4&2096297&0+", a)))
would output: True

Use python 3 regex to match a string in double quotes

I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.

In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"

You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'

You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(

Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'

You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

regex issue with numbers (python)

I have a string thath goes like that:
<123, 321>
the range of the numbers can be between 0 to 999.
I need to insert those coordinates as fast as possible into two variables, so i thought about regex. I've already splitted the string to two parts and now i need to isolate the integer from all the other characters.
I've tried this pattern:
^-?[0-9]+$
but the output is:
[]
any help? :)

If your strings follow the same format <123, 321> then this should be a little bit faster than regex approach
def str_translate(s):
return s.translate(None, " <>").split(',')
In [52]: str_translate("<123, 321>")
Out[52]: ['123', '321']

All you need to do is to get rid of the anchors( ^ and $)
>>> import re
>>> string = "<123, 321>"
>>> re.findall(r"-?[0-9]+", string)
['123', '321']
>>>
Note ^ $ Anchors at the start and end of patterns -?[0-9]+ ensures that the string consists only of digits.
That is the regex engine attempts to match the pattern from the start of the string, ^ using -?[0-9]+ till the end of the string $. But it fails because < cannot be matched by -?[0-9]+
Where as the re.findall will find all substrings that match the pattern -?[0-9]+, that is the digits.

"^-?[0-9]+$" will only match a string that contains a number and nothing else.
You want to match a group and the extract that group:
>>> pattern = re.compile("(-?[0-9]+)")
>>> pattern.findall("<123, 321>")
['123', '321']

Why does my python regex not work?

I wanna replace all the chars which occur more than one time,I used Python's re.sub and my regex looks like this data=re.sub('(.)\1+','##',data), But nothing happened...
Here is my Text:
Text
※※※※※※※※※※※※※※※※※Chapter One※※※※※※※※※※※※※※※※※※
This is the begining...

You need to use raw string here, 1 is interpreted as octal and then its ASCII value present at its integer equivalent is used in the string.
>>> '\1'
'\x01'
>>> chr(01)
'\x01'
>>> '\101'
'A'
>>> chr(0101)
'A'
Use raw string to fix this:
>>> '(.)\1+'
'(.)\x01+'
>>> r'(.)\1+' #Note the `r`
'(.)\\1+'

Use a raw string, so the regex engine interprets backslashes instead of the Python parser. Just put an r in front of the string:
data=re.sub(r'(.)\1+', '##', data)
^ this r is the important bit
Otherwise, \1 is interpreted as character value 1 instead of a backreference.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex match&replace for raw string - python

Related

Python regex to check if a string has particular set of numbers

Use python 3 regex to match a string in double quotes

Add [] around numbers in strings

regex issue with numbers (python)

Why does my python regex not work?

Categories

Resources