Regex: parsing differently if character is escaped - python

Given this string "foo-bar=369,337,234,123", I'm able to parse it to ['foo-bar', '369', '337', '234', '123] with this regular expression:
re.findall(r'[a-zA-Z0-9\-_\+;]+', 'foo-bar=369,337,234,123')
Now, if I escape some of the , in the string, e.g. "foo-bar=369\,337\,234,123", I would like it to be parsed a bit differently: ['foo-bar', '369\,337\,234', '123']. I tried the below regex but it doesn't work:
r'[a-zA-Z0-9\-_\+;(\\,)]+'
basically trying to add the sequence of characters \, to the list of characters to match.

You may use
[a-zA-Z0-9_+;-]+(?:\\,[a-zA-Z0-9_+;-]+)*
See the regex demo
If you pass re.A or re.ASCII to re.compile, you may shorten it to
[\w+;-]+(?:\\,[\w+;-]+)*
Regex details
[\w+;-]+ - one or more word, +, ; or - chars
(?:\\,[\w+;-]+)* - 0 or more occurrences of a \, substring followed with 1+ word, +, ; or - chars.
Python demo:
import re
strings = [r'foo-bar=369,337,234,123', r'foo-bar=369\,337\,234,123']
rx = re.compile(r"[\w+;-]+(?:\\,[\w+;-]+)*", re.A)
for s in strings:
print(f"Parsing {s}")
print(rx.findall(s))
Output:
Parsing foo-bar=369,337,234,123
['foo-bar', '369', '337', '234', '123']
Parsing foo-bar=369\,337\,234,123
['foo-bar', '369\\,337\\,234', '123']
Note the double backslashes here, inside string literals, denote a single literal backslash.

Related

How to remove a digit pattern at the beginning of a string with regex

I have a string like so, "123.234.567 Remove numbers in this string". The output should be "Remove numbers in this string".
The digits in this string follow the pattern xx.xxx.xxxxx...(digits followed by a period), but the number of periods and digits between each period is not static. here are a couple examples. xx.xxxxxx.xxxx.xxxxxxxx, x.xx.xxxx.xxxxxxxx.xx.xxxxx, x.xx.xxxxxx, etc.
How can I remove these digits followed by periods in regex?
So far I have something like this:
patt = re.compile('(\s*)[0-9].[0-9]*.[0-9]*(\s*)')
But this only works for a specific format.
Use ^ to match the beginning of the string.
Use \d+ to match any number of digits.
Use \. to match a literal . character
Put \.\d+ in a group with () so you can quantify it to match any number of them.
Use re.sub() to replace it with an empty string to remove the match.
Use a raw string so you can put literal backslashes in the regexp without having to escape them.
patt = re.compile(r's*^\d+(?:\.\d+)+\s*')
string = patt.replace('', string)

Replacing a special identifier pattern with re.sub in python

I am new to regex and have a regex replacement in a re.sub that I can't figure out.
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'[/|#][A-Z|a-z|0-9|-]*','', test)
print(test)
The code should print:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String
But, instead I am currently getting this (with 4,5,8 not fully converted):
1-Some String
2-Some String
3-Some String
4-Some String (Fubar )
5-Some String (Fubar - .67 A)
6-Some String
7-Some String
8-Some String
Please try the following:
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))','', test)
print(test)
Result:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String
The regex (substring to delete) can be defined as:
To start with "/", "#" or "- "
May be preceded by whitespace(s)
To consist of whitespaces, alphanumerics, hyphens, hashes or dots
To be anchored by "end of line" or ")" by using a positive lookahead
Then the regex will look like:
\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))
positive lookahead may require some explanation. The pattern (?=regex)
is a zero-width assertion meaning followed by regex.
The benefit is the matched substring does not include the regex and
you can use it as an anchor.
Another option is to match only the last occurrence of # using a negative lookahead (?![^#\n\r]*#). For clarity I have put matching a space [ ] between square brackets.
[ ]*(?:[/-][ ]*)?#(?![^#\n\r]*#)[\da-zA-Z. -]+
Explanation
[ ]* Match 0+ times a space
(?:[/-][ ]*)? Optionally match / or - and 0+ spaces
# Match literally
(?![^#\n\r]*#) Negative lookahead, assert when is om the right does not contain #
[\da-zA-Z. -]+ Match 1+ times what is listed in the character class
Regex demo
In the replacement use an empty string.
It is probably easier to do it in two steps:
First: Clean up the part in parenthesis. After the '(' and some letters remove everything up to the closing ')'.
Second: Remove the unwanted stuff at the end of a line. A line ends either at '#' followed by 2 or more digits or a '/'. There may be a space before the '#' or '/'.
import re
paren_re = re.compile(r"([(][a-zA-Z]+)([^)]*)")
eol_re = re.compile(r"(.*?)\s*(?:#\d\d|/).*")
for line in test_cases:
result = paren_re.sub(r"\1", line)
result = eol_re.sub(r"\1", result)
print(result)
I couldn't fit them into one regex, maybe someone can. Here's a 2-line solution:
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'[\/#][\w\s\d\-]*', '', test)
test = re.sub(r'[\s\.\-\d]+\w+\)', ')', test)
print(test)
Output:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some
Explain:
\w for a-zA-Z
\d for 0-9
\s for spaces
\. for dot
\- for minus
But I'm confused with your last line of output, why it outputs #1 String, based on what? If you confirm that you can write a specific regex for that pattern.

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(
Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'
You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

Python: How to replace string enclosed by certain characters

I want to replace a certain variable name in a mathematical expression while avoid to replace in function names.
For example following replacement of n:
sin(2 pi*n d)" -> "sin(2 pi*REPL d) but not: siREPL(2 pi*REPL d)
My idea was to check whether the substr is enclosed by special characters (' ' , '(', '*', etc) at one side but I failed to put it in regex or python code.
Any ideas?
Use word boundary(\b)
>>> import re
>>> re.sub(r'\bn\b', 'REPL', 'sin(2 pi*n d)')
'sin(2 pi*REPL d)'
According to the re module documentation:
\b
Matches the empty string, but only at the beginning or end of a word.
A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, ...

Categories

Resources