Python - regex cleanup for any brackets - python

My code needs to clean up the whitespace around brackets of any kind, so I assume using regex is my best course of action. My strings will (I think) always look like the following (although more robustness is always appreciated):
text = "the people ( that don't still like / love you } are going to ..."
to look like:
final = "the people (that don't still like / love you} are going to ..."
My current attempt seems to do nothing (I know it only considers round brackets for now):
final = re.sub( r'\s[\(]+\s(\w*)\s[\)]+\s' , '\s[\(]+\1[\)]+\s' , text )
Please & thank you.

In your example string, you want to remove spaces after the opening and before the closing bracket for not the same type of brackets.
The pattern that you tried does not work as there are multiple words between ( and ) and you are not matching the }
Note that in the character class you don't have to escape the parenthesis.
([{[(])\s*(.*?)\s*([]})])
Explanation
([{[(]) Capture group 1 match any of the listed brackets
\s* Match 0+ whitespace chars
(.*?) Capture group 2, match any char, as least as possible
\s* Match 0+ whitespace chars
([]})]) Capture group 3 match any of the listed brackets
See a regex demo
Replace with 3 capturing groups.
\1\2\3

Related

regex non greedy quantifier catching nothing, greedy catching too much

I'm writing a python regex formula that parses the content of a heading, however the greedy quantifier is not working well, and the non greedy quantifier is not working at all.
My string is
Step 1 Introduce The Assets:
Step2 Verifying the Assets
Step 3Making sure all the data is in the right place:
What I'm trying to do is extract the step number, and the heading, excluding the :.
Now I've tried multiple regex string and came up with these 2:
r1 = r"Step ?([0-9]+) ?(.*) ?:?"
r2 = r"Step ?([0-9]+) ?(.*?) ?:?"
r1 is capturing the step number, but is also capturing : at the end.
r2 is capturing the step number, and ''. I'm not sure how to handle the case where there is a .* followed by a string.
Necessary Edit:
The heading might contain : inside the string, I just want to ignore the trailing one. I know I can strip(':') but I want to understand what I'm doing wrong.
You can write the pattern using a negated character class without the non greedy and optional parts using a negated character class:
\bStep ?(\d+) ?([^:\n]+)
\bStep ? Match the word Step and optional space
(\d+) ? Capture 1+ digits in group 1 followed by matching an optional space
([^:\n]+) Capture 1+ chars other than : or a newline in group 2
Regex demo
If the colon has to be at the end of the string:
\bStep ?(\d+) ?([^:\n]+):?$
Regex demo

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.
Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

[FORKING]Python Regex - Re.Sub and Re.Findall Interesting Challenges

Not sure if this is something that should be a bounty. II just want to understand regex better.
I checked the responses in the Regex to match pattern.one skip newlines and characters until pattern.two and Regex to match if given text is not found and match as little as possible threads and read about Tempered Greedy Token Solutions and Explicit Greedy Alternation Solutions on RexEgg, but admittedly the explanations baffled me.
I spent the last day fiddling mainly with re.sub (and with findall) because re.sub's behaviour is odd to me.
.
Problem 1:
Given Strings below with characters followed by / how would I produce a SINGLE regex (using only either re.sub or re.findall) that uses alternating capture groups which must use [\S]+/ to get the desired output
>>> string_1 = 'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
>>> string_2 = 'variety.com/2017/biz/the/life/of/madam/green/news/tax-march-donald-trump-protest-1202031487/'
>>> string_3 = 'variety.com/2017/biz/the/life/of/news/tax-march-donald-trump-protest-1202031487/the/days/of/our/lives'
Desired Output Given the Conditions(!!)
tax-march-donald-trump-protest-
CONDITIONS: Must use alternating capture groups which must capture ([\S]+) or ([\S]+?)/ to capture the other groups but ignore them if they don't contain -
I'M WELL AWARE that it would be better to use re.findall('([\-]*(?:[^/]+?\-)+)[\d]+', string) or something similar but I want to know if I can use [\S]+ or ([\S]+) or ([\S]+?)/ and tell regex that if those are captured, ignore the result if it contains / or doesn't contain - While also having used an alternating capture group
I KNOW I don't need to use [\S]+ or ([\S]+) but I want to see if there is an extra directive I can use to make the regex reject some characters those two would normally capture.
Posted per request:
(?:(?!/)[\S])*-(?:(?!/)[\S])*
https://regex101.com/r/azrwjO/1
Explained
(?: # Optional group
(?! / ) # Not a forward slash ahead
[\S] # Not whitespace class
)* # End group, do 0 to many times
- # A dash must exist
(?: # Optional group, same as above
(?! / )
[\S]
)*
You could use
/([-a-z]+)-\d+
and take the first capturing group, see a demo on regex101.com.

Regex to ignore pattern found in quotes (Python or R)

I am trying to create a regex that allows me to find instances of a string where I have an unspaced /
eg:
some characters/morecharacters
I have come up with the expression below which allows me to find word characters or closing parenthesis before my / and word characters or open parenthesis characters afterwards.
(\w|\))/(\(|\w)
This works great for most situations, however I am coming unstuck when I have a / enclosed in quotes. In this case I'd like it to be ignored. I have seen a few different posts here and here. However, I can't quite get them to work in my situation.
What I'd like is for first three cases identified below to match and the last cast to be ignored allowing me to extract item 1 and item 3.
some text/more text
(formula)/dividethis
divideme/(byme)
"dont match/me"
It ain't pretty, but this will do what you want:
(?<!")(?:\(|\b)[^"\n]+\/[^"\n]+(?:\)|\b)(?!")
Demo on Regex101
Let's break it down a bit:
(?<!")(?:\(|\b) will match either an open bracket or a word boundary, as long as it's not preceded by a quotation mark. It does this by employing a negative lookbehind.
[^"\n]+ will match one or more characters, as long as they're neither a quotation mark or a line break (\n).
\/ will match a literal slash character.
Finally, (?:\)|\b)(?!") will match either a closing bracket or a word boundary as long as it's not followed by a quotation mark. It does this by employing a negative lookahead. Note that the (?:\)|\b) will only work 100% correctly in this order - if you reverse them, it'll drop the match on the bracket, because it encounters a word boundary before it gets to the bracket.
This will only match word/word which is not inside quotation marks.
import re
text = """
some text/more text "dont match/me" divideme/(byme)
(formula)/dividethis
divideme/(byme) "dont match/me hel d/b lo a/b" divideme/(byme)
"dont match/me"
"""
groups=re.findall("(?:\".*?\")|(\S+/\S+)", text, flags=re.MULTILINE)
print filter(None,groups)
Output:
['text/more', 'divideme/(byme)', '(formula)/dividethis', 'divideme/(byme)', 'divideme/(byme)']
(?:\".*?\") This will match everything inside quotes but this group won't be captured.
(\S+/\S+) This will match word/word only outside the quotations and this group will be captured.
Demo on Regex101

Find first ReGex pattern following a different pattern

Objective: find a second pattern and consider it a match only if it is the first time the pattern was seen following a different pattern.
Background:
I am using Python-2.7 Regex
I have a specific Regex match that I am having trouble with. I am trying to get the text between the square brackets in the following sample.
Sample comments:
[98 g/m2 Ctrl (No IP) 95 min 340oC ]
[ ]
I need the line:
98 g/m2 Ctrl (No IP) 95 min 340oC
The problem is the undetermined number of white-spaces, tabs, and new-lines between the search pattern Sample comments: and the match I want is giving me trouble.
Best Attempt:
I am able to match the first part easily,
match = re.findall(r'Sample comments:[.+\n+]+', string)
But I can't get the match to the length I want to grab the portion between the square brackets,
match = re.findall(r'Sample comments:[.+\n+]+\[(.+)\]', string)
My Thinking:
Is there a way to use ReGex to find the first instance of the pattern \[(.+)\] after a match of the pattern Sample comments:? Or is there a more robust way to find the bit between the square braces in my example case.
Thanks,
Michael
I suggest using
r'Sample comments:\s*\[(.*?)\s*]'
See the regex and IDEONE demo
The point is the \s* matches zero or more whitespace, both vertical (linebreaks) and horizontal. See Python re reference:
\s
When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]. The LOCALE flag has no extra effect on matching of the space. If UNICODE is set, this will match the characters [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database.
Pattern details:
Sample comments: - a sequence of literal chars
\s* - 0 or more whitespaces
\[ - a literal [
(.*?) - Group 1 (returned by re.findall) capturing 0+ any chars but a newline as few as possible up to the first...
\s* - 0+ whitespaces and
] - a literal ] (note it does not have to be escaped outside the character class).
Not sure if I understand your problem correctly, but re.findall('Sample comments:[^\\[]*\\[([^\\]]*)\\]', string) seems to work.
Or maybe re.findall('Sample comments:[^\\[]*\\[[ \t]*([^\\]]*?)[ \t]*\\]', string) if you want to strip the final spaces from your line?

Categories

Resources