Why this string matches the pattern ?
pattern = """
^Page \d of \d$|
^Group \d Notes$|
^More word lists and tips at http://wwwmajortests.com/word-lists$|
"""
re.match(pattern, "stackoverflow", re.VERBOSE)
According to me it should match strings like "Page 1 of 1" or "Group 1 Notes".
In your regular expression, there's trailing |:
# ^More word lists and tips at http://wwwmajortests.com/word-lists$|
# ^
Empty pattern matches any string:
>>> import re
>>> re.match('abc|', 'abc')
<_sre.SRE_Match object at 0x7fc63f3ff3d8>
>>> re.match('abc|', 'bbbb')
<_sre.SRE_Match object at 0x7fc63f3ff440>
So, Remove the trailing |.
BTW, you don't need ^ becasue re.match checks for a match only at the beginning of the string.
And, I recommend you to use raw strings(r'....') to correctly escape backslahes.
ADDITIONAL NOTE
\d matches only a single digit. Use \d+ if you also want to match multiple digits.
Related
I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.
I am new to regex and have a regex replacement in a re.sub that I can't figure out.
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'[/|#][A-Z|a-z|0-9|-]*','', test)
print(test)
The code should print:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String
But, instead I am currently getting this (with 4,5,8 not fully converted):
1-Some String
2-Some String
3-Some String
4-Some String (Fubar )
5-Some String (Fubar - .67 A)
6-Some String
7-Some String
8-Some String
Please try the following:
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))','', test)
print(test)
Result:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String
The regex (substring to delete) can be defined as:
To start with "/", "#" or "- "
May be preceded by whitespace(s)
To consist of whitespaces, alphanumerics, hyphens, hashes or dots
To be anchored by "end of line" or ")" by using a positive lookahead
Then the regex will look like:
\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))
positive lookahead may require some explanation. The pattern (?=regex)
is a zero-width assertion meaning followed by regex.
The benefit is the matched substring does not include the regex and
you can use it as an anchor.
Another option is to match only the last occurrence of # using a negative lookahead (?![^#\n\r]*#). For clarity I have put matching a space [ ] between square brackets.
[ ]*(?:[/-][ ]*)?#(?![^#\n\r]*#)[\da-zA-Z. -]+
Explanation
[ ]* Match 0+ times a space
(?:[/-][ ]*)? Optionally match / or - and 0+ spaces
# Match literally
(?![^#\n\r]*#) Negative lookahead, assert when is om the right does not contain #
[\da-zA-Z. -]+ Match 1+ times what is listed in the character class
Regex demo
In the replacement use an empty string.
It is probably easier to do it in two steps:
First: Clean up the part in parenthesis. After the '(' and some letters remove everything up to the closing ')'.
Second: Remove the unwanted stuff at the end of a line. A line ends either at '#' followed by 2 or more digits or a '/'. There may be a space before the '#' or '/'.
import re
paren_re = re.compile(r"([(][a-zA-Z]+)([^)]*)")
eol_re = re.compile(r"(.*?)\s*(?:#\d\d|/).*")
for line in test_cases:
result = paren_re.sub(r"\1", line)
result = eol_re.sub(r"\1", result)
print(result)
I couldn't fit them into one regex, maybe someone can. Here's a 2-line solution:
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'[\/#][\w\s\d\-]*', '', test)
test = re.sub(r'[\s\.\-\d]+\w+\)', ')', test)
print(test)
Output:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some
Explain:
\w for a-zA-Z
\d for 0-9
\s for spaces
\. for dot
\- for minus
But I'm confused with your last line of output, why it outputs #1 String, based on what? If you confirm that you can write a specific regex for that pattern.
I am writing a Python script to find a tag name in a string like this:
string='Tag Name =LIC100 State =TRUE'
If a use a expression like this
re.search('Name(.*)State',string)
I get " =LIC100". I would like to get just LIC100.
Any suggestions on how to set up the pattern to eliminate the whitespace and the equal signal?
That is because you get 0+ chars other than line break chars from Name up to the last State. You may restrict the pattern in Group 1 to just non-whitespaces:
import re
string='Tag Name =LIC100 State =TRUE'
m = re.search(r'Name\s*=(\S*)',string)
if m:
print(m.group(1))
See the Python demo
Pattern details:
Name - a literal char sequence
\s* - 0+ whitespaces
= - a literal =
(\S*) - Group 1 capturing 0+ chars other than whitespace (or \S+ can be used to match 1 or more chars other than whitespace).
The easiest solution would probably just be to strip it out after the fact, like so:
s = " =LIC100 "
s = s.strip('= ')
print(s)
#LIC100
If you insist on doing it within the regex, you can try something like:
reg = r'Name[ =]+([A-Za-z0-9]+)\s+State'
Your current regex is failing because (.*) captures all characters until the occurance of State. Instead of capturing everything, you can use a positive lookbehind to describe what preceeds, but is not included in, the content you actually want to capture. In this case, "Name =" preceeds the match, so we can stick it in the lookbehind assertion as (?<=Name =), then proceed to capture everything until the next whitespace:
>>> import re
>>> s = 'Tag Name =LIC100 State =TRUE'
>>> r = re.compile("(?<=Name =)\w*")
>>> print(r.search(s))
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>
>>> print(r.search(s).group(0))
LIC100
Following the tips above, I manage to find a nice solution.
Actually, the string I am trying to process has some non-printable characters. It is like this
"Tag Name\x00=LIC100\x00\tState=TRUE"
Using the concept of lookahead and lookbehind I found the following solution:
import re
s = 'Tag Name\x00=LIC100\x00\tState=TRUE'
T=re.search(r'(?<=Name\x00=)(.*)(?=\x00\tState)',s)
print(T.group(0))
The nice thing about this is that the outcome does not have any non-printable character on it.
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>
I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(
Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'
You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/
I'm using ^category/|categories/$.
Why does ^categor[y|ies]/$ not work?
Your regex should be,
^categor(?:y|ies)/$
Use a non-capturing group instead of character class.
DEMO
Any special character inside character class are treated a literals except few. [y|ies] Matches a single character whether it may be y or | or i or e or s
>>> import re
>>> str = """
... category/
... categories/
... categories
... category"""
>>> m = re.findall(r'^categor(?:y|ies)/$', str, re.MULTILINE)
>>> m
['category/', 'categories/']
Explanation:
^ Asserts that we are at the beginning of the line.
categor Matches the string categor.
(?:y|ies) The above string categor must be followed be y or ies. In regex (?:) called non-capturing groups. It only do a matching operation not capturing.
/ Matches a literal forward slash /.
$ End of the line.